Parallel programming - a task-based approach

With the increasing availability of the multicore processors, there is a natural need to make use of the new computing resources. There are two very different modes of how we can handle this situation - with both ways being architecturally different:

Concurrency: here the system typically has several independent processes (threads), and these processes access common structures parallelly. In this case we have to handle the situation when two parallel actor wants to modify the same data (often a complex data structure and its fields). The system implementation typically achieves the goal by applying mutexes, where the writing of the same memory area is allowed only for one actor at the same time. These systems can really make use of the multiprocessing only if the computational activity is relatively low - in this case the collisions are relatively rare and therefore the synchronization activity doesn not trump the real, useful processor resources. This can be only the case until the parallel processes spend the time idle in waiting for something - typically an input trigger - in no longer than, say, a few percent. The final performance depends on several factors, heavily on the inner (dynamic) structure of the program. The more complex it is, the more probable that the system approaches the performance of a single processor machine. Theoretically this performance decrease could be asymptotic, but in practice - due to the synchronization overhead, which operates at constant times - the final performance can even go under this limit.
This architecture is usually applied in the embedded systems.
Parallel (multithreaded) processing: this is the case when we can identify places in the program, where there is a relatively big amout of data to be processed and this dataset can be divided up so that the chunks can be processed independently. Here the processing can be something like the conversion of the data or the calculation of some properties of the data. In the second case there is often a need to a (not parallelizable) postprocessing (where we have to calculate the accumulation of the subresults).
Depeneding on the actual implementation, we have to take into account that this mode is influenced by constant factors, too: the more threads are started, the more time is needed by the operating system to control these threads (espacially if the number of threads is significantly greater than that of the physical processors). If we use the task-based approach (described in detail later), we have to consider that if the task execution time is low (comparable to the time of the task management), the constant factors will use up the time (resulting in a longer-than-ideal execution). Therefore these cases always need a preliminary estimation to specify the proper amount of data to be processed by a single task.
Important: the previous factors influence only the parallelizable parts of the program. The whole performance is better described by the Amdahl's law (where a sequantial part can be for example the postprocessing step mentioned earlier).
This architecture is usually applied in -processing applications (like sound-, image-, text-).

The following part concentrate on the task-based approach, that has some similarities with real-life things, such as:

The number of executors - in our case, the processors - is limited and one executor deals only with one task at the same time (and doesn't know anything about the following tasks).
Real-life: a bank or a store with cash-desks that serve the customers.
If there are tasks to serve - real-life: a custmers wants to issue a money transfer - and there is a free executor, the latter has to handle it.
The tasks come in unpredictably: the whole structure doesn't have information about how the tasks come in (how many, ow frequently). Real-life: there are more customers in the stores in the late afternoon.
A new task is not allowed to select between the executors (even if it has some assumption about how long the executors work with their current tasks).

Why are these similarities important? Because sociological studies pointed out that the best overall performance is achieved only if the previous conditions are true. This is why you have to draw a number if you go to a bank or a service, and this is why the cash-desks are so unreliable and always make you think you have taken the wrong lane (the customers decide which one they choose).

A good task-based parallelism does the same: it sorts the incoming tasks in a waiting queue, and forward the tasks in this order to the free execution threads. If a task comes in and there is no free thread (processor), it waits in the queue until one of them finishes.

Advantages:

Efficiency: as the socilogical studies analyzed, in case of several tasks this gives the best overall result.
The system is not overloaded: the typical configuration works so that the number of working threads is not greater than that of the physical processors. In this case, the operating system doesn't have to deal so much with the thread management.
The filling factor of the processor load is maximal: until there are tasks to serve no processor is in idle state.
With a proper implementation the tasks are allowed to create new tasks. This means that it is easy to implement recursive algorithms: we don't have to change the original nature of the algorithm (recursion is very expressive) but we can make use of the multiprocessor resources.
Good testability: parallel programs are by nature very diffcult to test (the different runs of the program are never the same, at least not in terms of thread scheduling). As the executor structure can be configured so that we specify the number of threads, the testing can always start with one thread. For the most of the cases this is enough to discover the eventual algorithmic flaws. While incresing the number of threads we can check the performance change, too (this is extremely important, beacuse the slightest mistake - that violates the task independence - can cause serious performance problems). After all, we parallelize the programs because we want to have better performance, don't we?
As the tasks are stored in a queue, we can specify a limit which acts as the system overload limit (there are too many tasks waiting for service). The good thing is that we can react to this situation.
This structure can be used in concurrent programs, too. When the threads has to do a calculation and they start this as a task, we can have a better processor load (and less thread synchronization). What's more, if the structure is able handle relationships between the tasks, we can avoid the data collisions, too (or at least most of them).

Disadvantages, things to consider:

The proper task programming always requires some preliminary assumption about the task excution time - let's call it thread granularity. If this time is too short, the task management will prevail over the useful (calculation) time. If this time is too long, it is possible that the overall time will be longer than ideal, especially if the relative task execution times are very different (uneven or irregular granularity). With too big tasks we can approach the simple thread-based operation.
And this time can change from (actual physical) machine to machine.
As pratice shows, we can fortunately specify general lower limits for specific tasks.
This approach (unfortunately) doesn't alleviate the problem of false-sharing.

The stuff is implemented in the TTaskDispatcher class.

Attila NAGY

Last updated: 2018-08-30