The parallel implementation of the QuickSort algorithm
Main features
Applications of the parallel QuickSort algorithm (with the other optimizations):
The algorithm that we use as a starting point is this:
template<class T> void QuickSort(T items, int from, int to) { T t; int i, k; if (from >= to) // there's nothing to sort return; // split the items by the pivot item int pivot_index = from + (to - from)/2; // we get the item in the middle as pivot if (pivot_index != from) // move the pivot item to the first position swap(items[pivot_index], items[from]); t = items + from; k = from; // move the items in the array, to the space releated to starting item for (i = from + 1; i <= to; ++i) { if (items[i] < *t) { // the given item is smaller than the pivot item, exchange them ++k; swap(items[k], items[i]); } } if (from != k) // move the pivot item back to the place, where each left item is less than it swap(items[from], items[k]); // sort the subranges QuickSort(items, from, k - 1); QuickSort(items, k + 1, to); } |
As we can see, the algorithm calls itself recursively at the end and doesn't do anything after it. The TaskDispatcher can receive new tasks anytime, even from tasks that are already being executed - this makes for us possible to implement the recursive algorithms in this structure. And as these recursive tasks are independent (they work on separate data - that is, the ranges being sorted do not overlap), they can be executed parallely, too. The current parallel properties (number of threads) are set in the TaskDispatcher object, the client algorithm doesn't know anything about how these child tasks will be executed.
To be able to sort parallelly, we have to define our class that is responsible for the parallel execution:
template<class T> class TQuickSortTask: public TTaskDispatcher::TTask { T Items; // the items to be ordered int From, // the start position within Items To, // the end -"- LowerLimit; // the task granularity: this is the lower limit where we go into normal recursion TTaskDispatcher &Dispatcher; // the main TaskDispatcher public: TQuickSortTask(T items, int from, int to, int ll, TTaskDispatcher &dispatcher): Items(items), From(from), To(to), LowerLimit(ll), Dispatcher(dispatcher) {} void Execute(); // the method doing the actual work (called by the TaskDispatcher) }; template<class T> void TQuickSortTask |
As the code shows, task objects doesn't do anything but store the data domain to be sorted (items, start, end) and call the ParallelQuickSort function from the Execute() method (that is called from the TaskDispatcher that controls the parallel execution of the tasks).
Therefore we have to (re)write the algorithm so that it creates new tasks after the partition:
template<class T> void ParallelQuickSort(T items, int from, int to, int lower_limit, TTaskDispatcher &dispatcher) { //... the code is the same as in QuickSort ... // sort the subranges if (to - from + 1 >= lower_limit) { // new code: multithreaded sort with big enough ranges dispatcher.AddTask(new TQuickSortTask |
After spawning the new tasks for the less-than and greater-than ranges, we simply leave the function and don't wait for anything - the current task will be released by the TaskDispatcher (as it is the case for the subtasks). This means that we have to catch the end of the whole sorting process at an other place, right after the starting of the job:
vector<string> vs; // fill vs up with some data TTaskDispatcher dispatcher(threadno, TTaskDispatcher::imUnlimited); // initialize the dispatcher DWORD start = GetTickCount(); ParallelQuickSort(vs.begin(), 0, vs.size() - 1, 1000, dispatcher); // call the QuickSort dispatcher.WaitForGroup(0); // wait for each task to finish DWORD end = GetTickCount(); cout << "Elapsed: " << end - start << endl; |
The waiting is accomplished by the task grouping mechanism of the TaskDispatcher. In our case, this is very simple, as we don't send any complex job to the dispatcher and therefore each task has the default 0 group id - and we can wait for it.
Measurement results with a large (~700k lines) text file (in [msec]). The columns are the lower limits of the recursion/thread spawn.
Threads | 10 | 20 | 50 | 100 | 200 | 500 | 1000 | 2000 | 5000 | 10000 |
1 | 20732 | 5569 | 3261 | 2402 | 1856 | 1466 | 1279 | 1139 | 920 | 796 |
2 | 11186 | 3073 | 1857 | 1389 | 1108 | 904 | 811 | 717 | 561 | 421 |
3 | 8065 | 2309 | 1436 | 1123 | 889 | 765 | 702 | 624 | 468 | 296 |
4 | 6583 | 1966 | 1248 | 951 | 811 | 687 | 624 | 624 | 421 | 249 |
As we can see, the bigger the lower limit, the better results we achieve. In the best case (10000), the (best/worst) processor ratio is slightly above 3 - in case of 4 processors, a 3 times performance increase.
It is important to emphasize that this algorithm doesn't use any optimization for the sorting itself (as opposed to the TContainer class), such as:
Both of these optimizations contribute positively to the overall performance of the sorting, but this isn't the goal here.