The parallel implementation of the QuickSort algorithm

Main features

Applications of the parallel QuickSort algorithm (with the other optimizations):

The algorithm that we use as a starting point is this:
template<class T>	
void QuickSort(T items, int from, int to)
{
  T t;
  int i, k;

  if (from >= to) // there's nothing to sort
    return;

  // split the items by the pivot item
  int pivot_index = from + (to - from)/2; // we get the item in the middle as pivot
  if (pivot_index != from)
    // move the pivot item to the first position
    swap(items[pivot_index], items[from]);

  t = items + from;
  k = from;

  // move the items in the array, to the space releated to starting item
  for (i = from + 1; i <= to; ++i) {
    if (items[i] < *t) {
      // the given item is smaller than the pivot item, exchange them
      ++k;
      swap(items[k], items[i]);
      }
    }

  if (from != k)
    // move the pivot item back to the place, where each left item is less than it
    swap(items[from], items[k]);

  // sort the subranges
  QuickSort(items, from, k - 1);
  QuickSort(items, k + 1, to);
}

As we can see, the algorithm calls itself recursively at the end and doesn't do anything after it. The TaskDispatcher can receive new tasks anytime, even from tasks that are already being executed - this makes for us possible to implement the recursive algorithms in this structure. And as these recursive tasks are independent (they work on separate data - that is, the ranges being sorted do not overlap), they can be executed parallely, too. The current parallel properties (number of threads) are set in the TaskDispatcher object, the client algorithm doesn't know anything about how these child tasks will be executed.

To be able to sort parallelly, we have to define our class that is responsible for the parallel execution:
template<class T>
class TQuickSortTask: public TTaskDispatcher::TTask
{
    T Items;                    // the items to be ordered
    int From,                    // the start position within Items
        To,                      // the end         -"-
        LowerLimit;              // the task granularity: this is the lower limit where we go into normal recursion
    TTaskDispatcher &Dispatcher; // the main TaskDispatcher

  public:
    TQuickSortTask(T items, int from, int to, int ll, TTaskDispatcher &dispatcher): Items(items), From(from), To(to), LowerLimit(ll), Dispatcher(dispatcher) {}
    void Execute();              // the method doing the actual work (called by the TaskDispatcher)
};

template<class T>
void TQuickSortTask::Execute()
{
  // we simply call the (parallel) sorter function (see later) with the current data
  ParallelQuickSort(Items, From, To, LowerLimit, Dispatcher);
}

As the code shows, task objects doesn't do anything but store the data domain to be sorted (items, start, end) and call the ParallelQuickSort function from the Execute() method (that is called from the TaskDispatcher that controls the parallel execution of the tasks).

Therefore we have to (re)write the algorithm so that it creates new tasks after the partition:
template<class T>
void ParallelQuickSort(T items, int from, int to, int lower_limit, TTaskDispatcher &dispatcher)
{
  //... the code is the same as in QuickSort
  ...
  
  // sort the subranges
  if (to - from + 1 >= lower_limit) {
    // new code: multithreaded sort with big enough ranges
     dispatcher.AddTask(new TQuickSortTask(items, from, k - 1, lower_limit, dispatcher), TTaskDispatcher::wmNoWait);
     dispatcher.AddTask(new TQuickSortTask(items, k + 1, to, lower_limit, dispatcher), TTaskDispatcher::wmNoWait);
     }

   else {
    // old code: single-threaded, recursive subrange sort
    QuickSort(items, from, k - 1);
    QuickSort(items, k + 1, to);
    }

After spawning the new tasks for the less-than and greater-than ranges, we simply leave the function and don't wait for anything - the current task will be released by the TaskDispatcher (as it is the case for the subtasks). This means that we have to catch the end of the whole sorting process at an other place, right after the starting of the job:
  vector<string> vs;
  // fill vs up with some data

  TTaskDispatcher dispatcher(threadno, TTaskDispatcher::imUnlimited); // initialize the dispatcher
  DWORD start = GetTickCount();
  ParallelQuickSort(vs.begin(), 0, vs.size() - 1, 1000, dispatcher);  // call the QuickSort
  dispatcher.WaitForGroup(0);                                         // wait for each task to finish
  DWORD end = GetTickCount();
  cout << "Elapsed: " << end - start << endl;

The waiting is accomplished by the task grouping mechanism of the TaskDispatcher. In our case, this is very simple, as we don't send any complex job to the dispatcher and therefore each task has the default 0 group id - and we can wait for it.

Measurement results with a large (~700k lines) text file (in [msec]). The columns are the lower limits of the recursion/thread spawn.

Threads10205010020050010002000500010000
1 20732 5569 3261 2402 1856 1466 1279 1139 920 796
2 11186 3073 1857 1389 1108 904 811 717 561 421
3 8065 2309 1436 1123 889 765 702 624 468 296
4 6583 1966 1248 951 811 687 624 624 421 249

As we can see, the bigger the lower limit, the better results we achieve. In the best case (10000), the (best/worst) processor ratio is slightly above 3 - in case of 4 processors, a 3 times performance increase.

It is important to emphasize that this algorithm doesn't use any optimization for the sorting itself (as opposed to the TContainer class), such as:

Both of these optimizations contribute positively to the overall performance of the sorting, but this isn't the goal here.


Attila NAGY Last updated: 2018-08-30