The parallel implementation of the Fibonacci number calcuation
Main features:
The basic implementation is quite simple:
typedef int TCalcType; // with long long, we could prepare for _really_ big values, too template<class T> TCalcType Fibonacci(T n) { if (n < 2) // the end of the recursion, with 1 as a default value return 1; // calculate the result recursively return Fibonacci(n - 1) + Fibonacci(n - 2); } |
Although the algorithm is very simple, the calculation of the values above 40 can take a significant amount of time (in the range of seconds - 2 secs n the test machine with v=40).
To make the algorithm parallel, we have to extend it so that it spawns new tasks to calculate the subvalues.
The lower limit is important do decide whether it is really worth going parallely into the deeper layers: if this value is too high (the calculation is too fast) - it is
comparable with the task time needed by the task coordination - the overall result will deteriorate (significantly). On an average machine, this value is around 30.
template<class T> T ParallelFibonacci(T n, int lowlim, TTaskDispatcher &dispatcher) { if (n < 2) // the end of the recursion return 1; if (n < lowlim) // the lower limit has been reached, we go on with simple recursion return ParallelFibonacci(n - 1, lowlim, dispatcher) + ParallelFibonacci(n - 2, lowlim, dispatcher); else { // call parallel tasks with the subvalues (only start them) dispatcher.AddTask(new TFibonacciTask |
As we only start new tasks to calculate the subvalues and the current tasks will be released after this point, we need to sum the partial results of the tasks - to achieve that, first we need to store these values in the task objects:
template<class T> class TFibonacciTask: public TTaskDispatcher::TTask { T Value; // the value of the current level - the current task calculates the Fibonacci-number of this value int LowLim; // the general lower limit of the task spawning TTaskDispatcher &Dispatcher; // the dispatcher that controls the whole parallel processing public: T Result; // the result that belongs to Value TFibonacciTask(T value, int lowlim, TTaskDispatcher &dispatcher): Value(value), LowLim(lowlim), Dispatcher(dispatcher) {} void Execute(); }; template<class T> void TFibonacciTask<T>::Execute() { // the task doesn't do anything but calls the ParallelFibonacci to calculate the result Result = ParallelFibonacci(Value, LowLim, Dispatcher); } |
As the final result is the sum of the partial (task) results (see the original function that adds the results of the recursive calls), we have to sum the values of the tasks - we can do this in our specialized dispatcher:
template<class T> class TFibonacciDispatcher: public TTaskDispatcher { public: T Sum; // the final result TFibonacciDispatcher(int threadno): TTaskDispatcher(threadno, imUnlimited) { Sum = 0; } bool DeleteTask(TTask *); // we can override this method to have access to the partial (task) result (and add it to the final sum) }; template<class T> bool TFibonacciDispatcher |
Now it's easy to call the parallel algorithm:
TFibonacciDispatcher<TCalcType> dispatcher(threadno); // this is our specialized dispatcher start = GetTickCount(); ParallelFibonacci |
Measurement results with the original, recursive algorithm:
Value | 30 | 32 | 34 | 36 | 38 | 40 | 42 | 44 | 46 |
Time [msec] | 16 | 32 | 109 | 297 | 765 | 2012 | 5258 | 13775 | 36036 |
As we can see, we reach the limit of measurability around 30 (16 msec is the lower limit on Windows when we use the GetTickCount() function) and the times start to increase exponentially from that point.
As it is generally the case with the parallel calculations, we have to specify a lower limit where the algorithm goes into normal recursion instead of spawning new tasks - these times suggest the a lower limit is pratctical for our purposes.
Multithreaded measure results (with the lower limit being 32, therefor we calculate Fibonacci values only above this):
Threads | 36 | 38 | 40 | 42 | 44 | 46 |
1 | 281 | 765 | 2013 | 5257 | 13759 | 36036 |
2 | 188 | 421 | 1030 | 2652 | 6927 | 18111 |
3 | 125 | 281 | 702 | 1794 | 4633 | 12137 |
4 | 94 | 218 | 546 | 1389 | 3572 | 9313 |
As we can see, with this parameterization the performace increases linearly with the number of processors.
Important: we still need to carefully choose the proper value for the lower limit, because a wrong limit easily deteriorates - sometimes disastrously - the overall results.
For example (with 4 threads):
Lower limit | 30 | 25 | 20 | 15 |
Time [msec] | 1357 | 1435 | 2216 | 11388 |
As the table shows, our previous estimation of 30 for the lower limit was good: when we decrease this value, the calculation time starts to increase.
What is really bad: this performance deterioration has an exponential characteristics, accordingly to the nature of the original algorithm.
Download the test application (the parameterization can be seen by calling with -?)
The performance tests was made with the following command lines:
agen -v "a=[30,32,34,36,38,40,42,44,46]" $(a) | axargs fibonacci.exe -v $(FILE) agen -v "a=[36,38,40,42,44,46]" -v "b=[1,2,3,4]" $(a) $(b) | axargs -i 2 -a fibonacci.exe -l 30 -v $(FILE1) -j $(FILE2) agen -v "a=[30,25,20,15]" $(a) | axargs -a fibonacci.exe -v 42 -l $(FILE) -j 4 |
Attila NAGY | Last updated: 2018-08-30 |