The parallel implementation of the QuickSort algorithm

Main features

The implementation makes use of the recursive nature of the QuickSort algorithm.
The recursive calls work only on independent data - that is the chunks before and after the pivot item.
As the parallel implementation uses task objects that are created on the heap, the algorithm doesn't load the stack at all - as opposed to the normal recursive implementation. As the stack sizes are generally by orders of magnitude smaller than the entire memory, the recursive implementations have an upper limit in terms of the number of items to sort (determined by the current machine). This implementation - even when used with one thread - is not limited by this property (only by the enitre memory of the machine).
The examples don't deal with the non task parallelism related issues of the QuickSort (that make it - sometimes significantly - faster in special cases), the goal is here only to show the usage of the TaskDispatcher.
This implementation serves as a good example for any tail-recursive algorithm (where there is no processing activity after the recursive calls).
As the recursion is called at the end of the algorithm, we can write the parallel version so that we only have to wait for the tasks to finish after the whole process. The resources allocated for the caller tasks are freed immediately after this recursion. This is coordinated by the TaskDispatcher - and this class make us possible to wait for the last task of the current job (the whole sorting).
The example shows the importance of the task granularity.

Applications of the parallel QuickSort algorithm (with the other optimizations):

MEditor - for the line sorting
asort - the command line program to sort the input lines
TContainer - the most widely used (vector-like) container type of the SDPP library
(actually both the MEditor and the asort use this container for the sorting)

The algorithm that we use as a starting point is this:

template<class T> void QuickSort(T items, int from, int to) { T t; int i, k; if (from >= to) // there's nothing to sort return; // split the items by the pivot item int pivot_index = from + (to - from)/2; // we get the item in the middle as pivot if (pivot_index != from) // move the pivot item to the first position swap(items[pivot_index], items[from]); t = items + from; k = from; // move the items in the array, to the space releated to starting item for (i = from + 1; i <= to; ++i) { if (items[i] < *t) { // the given item is smaller than the pivot item, exchange them ++k; swap(items[k], items[i]); } } if (from != k) // move the pivot item back to the place, where each left item is less than it swap(items[from], items[k]); // sort the subranges QuickSort(items, from, k - 1); QuickSort(items, k + 1, to); }

As we can see, the algorithm calls itself recursively at the end and doesn't do anything after it. The TaskDispatcher can receive new tasks anytime, even from tasks that are already being executed - this makes for us possible to implement the recursive algorithms in this structure. And as these recursive tasks are independent (they work on separate data - that is, the ranges being sorted do not overlap), they can be executed parallely, too. The current parallel properties (number of threads) are set in the TaskDispatcher object, the client algorithm doesn't know anything about how these child tasks will be executed.

To be able to sort parallelly, we have to define our class that is responsible for the parallel execution:

template<class T> class TQuickSortTask: public TTaskDispatcher::TTask { T Items; // the items to be ordered int From, // the start position within Items To, // the end -"- LowerLimit; // the task granularity: this is the lower limit where we go into normal recursion TTaskDispatcher &Dispatcher; // the main TaskDispatcher public: TQuickSortTask(T items, int from, int to, int ll, TTaskDispatcher &dispatcher): Items(items), From(from), To(to), LowerLimit(ll), Dispatcher(dispatcher) {} void Execute(); // the method doing the actual work (called by the TaskDispatcher) }; template<class T> void TQuickSortTask::Execute() { // we simply call the (parallel) sorter function (see later) with the current data ParallelQuickSort(Items, From, To, LowerLimit, Dispatcher); }

As the code shows, task objects doesn't do anything but store the data domain to be sorted (items, start, end) and call the ParallelQuickSort function from the Execute() method (that is called from the TaskDispatcher that controls the parallel execution of the tasks).

Therefore we have to (re)write the algorithm so that it creates new tasks after the partition:

template<class T> void ParallelQuickSort(T items, int from, int to, int lower_limit, TTaskDispatcher &dispatcher) { //... the code is the same as in QuickSort ... // sort the subranges if (to - from + 1 >= lower_limit) { // new code: multithreaded sort with big enough ranges dispatcher.AddTask(new TQuickSortTask(items, from, k - 1, lower_limit, dispatcher), TTaskDispatcher::wmNoWait); dispatcher.AddTask(new TQuickSortTask(items, k + 1, to, lower_limit, dispatcher), TTaskDispatcher::wmNoWait); } else { // old code: single-threaded, recursive subrange sort QuickSort(items, from, k - 1); QuickSort(items, k + 1, to); }

After spawning the new tasks for the less-than and greater-than ranges, we simply leave the function and don't wait for anything - the current task will be released by the TaskDispatcher (as it is the case for the subtasks). This means that we have to catch the end of the whole sorting process at an other place, right after the starting of the job:

vector<string> vs; // fill vs up with some data TTaskDispatcher dispatcher(threadno, TTaskDispatcher::imUnlimited); // initialize the dispatcher DWORD start = GetTickCount(); ParallelQuickSort(vs.begin(), 0, vs.size() - 1, 1000, dispatcher); // call the QuickSort dispatcher.WaitForGroup(0); // wait for each task to finish DWORD end = GetTickCount(); cout << "Elapsed: " << end - start << endl;

The waiting is accomplished by the task grouping mechanism of the TaskDispatcher. In our case, this is very simple, as we don't send any complex job to the dispatcher and therefore each task has the default 0 group id - and we can wait for it.

Measurement results with a large (~700k lines) text file (in [msec]). The columns are the lower limits of the recursion/thread spawn.

Threads 10 20 50 100 200 500 1000 2000 5000 10000
1 20732 5569 3261 2402 1856 1466 1279 1139 920 796
2 11186 3073 1857 1389 1108 904 811 717 561 421
3 8065 2309 1436 1123 889 765 702 624 468 296
4 6583 1966 1248 951 811 687 624 624 421 249

As we can see, the bigger the lower limit, the better results we achieve. In the best case (10000), the (best/worst) processor ratio is slightly above 3 - in case of 4 processors, a 3 times performance increase.

It is important to emphasize that this algorithm doesn't use any optimization for the sorting itself (as opposed to the TContainer class), such as:

Insertion sort for small amount of data (similarly to the task lower limit).
Additional checking for the equal elements and discard these ranges before we call the subsorts.

Both of these optimizations contribute positively to the overall performance of the sorting, but this isn't the goal here.

Attila NAGY

Last updated: 2018-08-30

Threads	10	20	50	100	200	500	1000	2000	5000	10000
1	20732	5569	3261	2402	1856	1466	1279	1139	920	796
2	11186	3073	1857	1389	1108	904	811	717	561	421
3	8065	2309	1436	1123	889	765	702	624	468	296
4	6583	1966	1248	951	811	687	624	624	421	249