The parallel implementation of the Fibonacci number calcuation

Main features:

The Fibonacci number definition is a recursive algorithm, this gives a good opportunity for a parallel implementation.
The algorithm uses tail-recursion, which is easy to implement with the TaskDispatcher (see the similarities with the QuickSort implementation).
The sample demonstrates the OOP programming of the TaskDispatcher, too.

The basic implementation is quite simple:

typedef int TCalcType; // with long long, we could prepare for _really_ big values, too template<class T> TCalcType Fibonacci(T n) { if (n < 2) // the end of the recursion, with 1 as a default value return 1; // calculate the result recursively return Fibonacci(n - 1) + Fibonacci(n - 2); }

Although the algorithm is very simple, the calculation of the values above 40 can take a significant amount of time (in the range of seconds - 2 secs n the test machine with v=40).

To make the algorithm parallel, we have to extend it so that it spawns new tasks to calculate the subvalues.
The lower limit is important do decide whether it is really worth going parallely into the deeper layers: if this value is too high (the calculation is too fast) - it is comparable with the task time needed by the task coordination - the overall result will deteriorate (significantly). On an average machine, this value is around 30.

template<class T> T ParallelFibonacci(T n, int lowlim, TTaskDispatcher &dispatcher) { if (n < 2) // the end of the recursion return 1; if (n < lowlim) // the lower limit has been reached, we go on with simple recursion return ParallelFibonacci(n - 1, lowlim, dispatcher) + ParallelFibonacci(n - 2, lowlim, dispatcher); else { // call parallel tasks with the subvalues (only start them) dispatcher.AddTask(new TFibonacciTask(n - 1, lowlim, dispatcher), TTaskDispatcher::wmNoWait); dispatcher.AddTask(new TFibonacciTask(n - 2, lowlim, dispatcher), TTaskDispatcher::wmNoWait); } return 0; }

As we only start new tasks to calculate the subvalues and the current tasks will be released after this point, we need to sum the partial results of the tasks - to achieve that, first we need to store these values in the task objects:

template<class T> class TFibonacciTask: public TTaskDispatcher::TTask { T Value; // the value of the current level - the current task calculates the Fibonacci-number of this value int LowLim; // the general lower limit of the task spawning TTaskDispatcher &Dispatcher; // the dispatcher that controls the whole parallel processing public: T Result; // the result that belongs to Value TFibonacciTask(T value, int lowlim, TTaskDispatcher &dispatcher): Value(value), LowLim(lowlim), Dispatcher(dispatcher) {} void Execute(); }; template<class T> void TFibonacciTask<T>::Execute() { // the task doesn't do anything but calls the ParallelFibonacci to calculate the result Result = ParallelFibonacci(Value, LowLim, Dispatcher); }

As the final result is the sum of the partial (task) results (see the original function that adds the results of the recursive calls), we have to sum the values of the tasks - we can do this in our specialized dispatcher:

template<class T> class TFibonacciDispatcher: public TTaskDispatcher { public: T Sum; // the final result TFibonacciDispatcher(int threadno): TTaskDispatcher(threadno, imUnlimited) { Sum = 0; } bool DeleteTask(TTask *); // we can override this method to have access to the partial (task) result (and add it to the final sum) }; template<class T> bool TFibonacciDispatcher::DeleteTask(TTask *task) { // before we release the task object, we add its result to the sum Sum += ((TFibonacciTask *)task)->Result; // we have to call the original method to release the object return TTaskDispatcher::DeleteTask(task); }

Now it's easy to call the parallel algorithm:

TFibonacciDispatcher<TCalcType> dispatcher(threadno); // this is our specialized dispatcher start = GetTickCount(); ParallelFibonacci(value, lowlim, dispatcher); // call the function dispatcher.WaitForGroup(0); // wait for the tasks to finish end = GetTickCount(); f = dispatcher.Sum; // dispatcher.Sum contains the final result

Measurement results with the original, recursive algorithm:

Value 30 32 34 36 38 40 42 44 46
Time [msec] 16 32 109 297 765 2012 5258 13775 36036

As we can see, we reach the limit of measurability around 30 (16 msec is the lower limit on Windows when we use the GetTickCount() function) and the times start to increase exponentially from that point.

As it is generally the case with the parallel calculations, we have to specify a lower limit where the algorithm goes into normal recursion instead of spawning new tasks - these times suggest the a lower limit is pratctical for our purposes.

Multithreaded measure results (with the lower limit being 32, therefor we calculate Fibonacci values only above this):

Threads 36 38 40 42 44 46
1 281 765 2013 5257 13759 36036
2 188 421 1030 2652 6927 18111
3 125 281 702 1794 4633 12137
4 94 218 546 1389 3572 9313

As we can see, with this parameterization the performace increases linearly with the number of processors.

Important: we still need to carefully choose the proper value for the lower limit, because a wrong limit easily deteriorates - sometimes disastrously - the overall results.
For example (with 4 threads):

Lower limit 30 25 20 15
Time [msec] 1357 1435 2216 11388

As the table shows, our previous estimation of 30 for the lower limit was good: when we decrease this value, the calculation time starts to increase.
What is really bad: this performance deterioration has an exponential characteristics, accordingly to the nature of the original algorithm.

Download the test application (the parameterization can be seen by calling with -?)

The performance tests was made with the following command lines:

agen -v "a=[30,32,34,36,38,40,42,44,46]" $(a) | axargs fibonacci.exe -v $(FILE) agen -v "a=[36,38,40,42,44,46]" -v "b=[1,2,3,4]" $(a) $(b) | axargs -i 2 -a fibonacci.exe -l 30 -v $(FILE1) -j $(FILE2) agen -v "a=[30,25,20,15]" $(a) | axargs -a fibonacci.exe -v 42 -l $(FILE) -j 4

Attila NAGY

Last updated: 2018-08-30

Value	30	32	34	36	38	40	42	44	46
Time [msec]	16	32	109	297	765	2012	5258	13775	36036

Threads	36	38	40	42	44	46
1	281	765	2013	5257	13759	36036
2	188	421	1030	2652	6927	18111
3	125	281	702	1794	4633	12137
4	94	218	546	1389	3572	9313