The parallel implementation of the Fibonacci number calcuation

Main features:

The basic implementation is quite simple:
typedef int TCalcType; // with long long, we could prepare for _really_ big values, too

template<class T>
TCalcType Fibonacci(T n)
{
  if (n < 2)

    // the end of the recursion, with 1 as a default value
    return 1;

  // calculate the result recursively
  return Fibonacci(n - 1) + Fibonacci(n - 2);
}

Although the algorithm is very simple, the calculation of the values above 40 can take a significant amount of time (in the range of seconds - 2 secs n the test machine with v=40).

To make the algorithm parallel, we have to extend it so that it spawns new tasks to calculate the subvalues.
The lower limit is important do decide whether it is really worth going parallely into the deeper layers: if this value is too high (the calculation is too fast) - it is comparable with the task time needed by the task coordination - the overall result will deteriorate (significantly). On an average machine, this value is around 30.
template<class T>
T ParallelFibonacci(T n, int lowlim, TTaskDispatcher &dispatcher)
{
  if (n < 2)
    // the end of the recursion
    return 1;

  if (n < lowlim)
    // the lower limit has been reached, we go on with simple recursion
    return ParallelFibonacci(n - 1, lowlim, dispatcher) + ParallelFibonacci(n - 2, lowlim, dispatcher);
  else {
    // call parallel tasks with the subvalues (only start them)
    dispatcher.AddTask(new TFibonacciTask(n - 1, lowlim, dispatcher), TTaskDispatcher::wmNoWait);
    dispatcher.AddTask(new TFibonacciTask(n - 2, lowlim, dispatcher), TTaskDispatcher::wmNoWait);
    }

  return 0;
}

As we only start new tasks to calculate the subvalues and the current tasks will be released after this point, we need to sum the partial results of the tasks - to achieve that, first we need to store these values in the task objects:
template<class T>
class TFibonacciTask: public TTaskDispatcher::TTask
{
    T Value;                     // the value of the current level - the current task calculates the Fibonacci-number of this value
    int LowLim;                  // the general lower limit of the task spawning
    TTaskDispatcher &Dispatcher; // the dispatcher that controls the whole parallel processing

  public:
    T Result;                    // the result that belongs to Value

    TFibonacciTask(T value, int lowlim, TTaskDispatcher &dispatcher): Value(value), LowLim(lowlim), Dispatcher(dispatcher) {}
    void Execute();
};

template<class T>
void TFibonacciTask<T>::Execute()
{
  // the task doesn't do anything but calls the ParallelFibonacci to calculate the result
  Result = ParallelFibonacci(Value, LowLim, Dispatcher);
}

As the final result is the sum of the partial (task) results (see the original function that adds the results of the recursive calls), we have to sum the values of the tasks - we can do this in our specialized dispatcher:
template<class T>
class TFibonacciDispatcher: public TTaskDispatcher
{
  public:
    T Sum;                    // the final result

    TFibonacciDispatcher(int threadno): TTaskDispatcher(threadno, imUnlimited) {
      Sum = 0;
      }

    bool DeleteTask(TTask *); // we can override this method to have access to the partial (task) result (and add it to the final sum)
};

template<class T>
bool TFibonacciDispatcher::DeleteTask(TTask *task)
{
  // before we release the task object, we add its result to the sum
  Sum += ((TFibonacciTask *)task)->Result;

  // we have to call the original method to release the object
  return TTaskDispatcher::DeleteTask(task);
}

Now it's easy to call the parallel algorithm:
      TFibonacciDispatcher<TCalcType> dispatcher(threadno);    // this is our specialized dispatcher

      start = GetTickCount();
      ParallelFibonacci(value, lowlim, dispatcher); // call the function
      dispatcher.WaitForGroup(0);                              // wait for the tasks to finish
      end = GetTickCount();
      f = dispatcher.Sum;                                      // dispatcher.Sum contains the final result

Measurement results with the original, recursive algorithm:
Value303234363840424446
Time [msec]1632109297765201252581377536036

As we can see, we reach the limit of measurability around 30 (16 msec is the lower limit on Windows when we use the GetTickCount() function) and the times start to increase exponentially from that point.

As it is generally the case with the parallel calculations, we have to specify a lower limit where the algorithm goes into normal recursion instead of spawning new tasks - these times suggest the a lower limit is pratctical for our purposes.

Multithreaded measure results (with the lower limit being 32, therefor we calculate Fibonacci values only above this):
Threads363840424446
1281765201352571375936036
218842110302652692718111
31252817021794463312137
494218546138935729313

As we can see, with this parameterization the performace increases linearly with the number of processors.

Important: we still need to carefully choose the proper value for the lower limit, because a wrong limit easily deteriorates - sometimes disastrously - the overall results.
For example (with 4 threads):
Lower limit30252015
Time [msec]13571435221611388

As the table shows, our previous estimation of 30 for the lower limit was good: when we decrease this value, the calculation time starts to increase.
What is really bad: this performance deterioration has an exponential characteristics, accordingly to the nature of the original algorithm.

Download the test application (the parameterization can be seen by calling with -?)

The performance tests was made with the following command lines:
agen -v "a=[30,32,34,36,38,40,42,44,46]" $(a) | axargs fibonacci.exe -v $(FILE)

agen -v "a=[36,38,40,42,44,46]" -v "b=[1,2,3,4]" $(a) $(b) | axargs -i 2 -a fibonacci.exe -l 30 -v $(FILE1) -j $(FILE2)

agen -v "a=[30,25,20,15]" $(a) | axargs -a fibonacci.exe -v 42 -l $(FILE) -j 4


Attila NAGY Last updated: 2018-08-30