3
votes

Consider the following scenario: I am writing a function, within which there is a computationally intensive loop. I parallelized it with TBB's parallel_for. Now, the problem is that this function may be used on its own, and benefits from the parallelization. Or it maybe used within another loop. In the later case, the outer loop can also be parallelized. And often, it is better to only parallelize the outer loop.

Normally in TBB parallelize both the outer and inner loop is not a problem, since unlike OpenMP, nested parallelization in TBB does not results in additional threads being created. TBB merely create more tasks. However, sometime the overhead of the creating more tasks in the inner loop is still undesirable (I observed a 40% slowdown in one extreme situations).

So is there a way to have TBB do not create any task when parallel_for etc is invoked while execution another parallel_for algorithm? Similar to the effect of OMP_NESTED=FALSE for OpenMP.

1
I've added some paragraph breaks so that this isn't just a "wall of text"Damien_The_Unbeliever

1 Answers

2
votes

Simple answer: No

Simple advice: Don't use simple_partitioner

There is no way to affect the parallel_for or other algorithms from outside or on the outer level except restricting their concurrency via task_scheduler_init or task_arena. Though, they are not well-suited for nested parallelism in any case.

Anyway, there should not be such a big impact on the performance if auto_partitioner is used (especially, on the nested level) and you follow TBB recommendation on the amount of work which is efficient for parallelization.

Though I admit that in the extreme cases it can be a problem. We (TBB developers) thought on optimizing auto partitioning parameters of parallel_for depending on the context where it is being executed. But the issue is that knowing whether we are on the nested level or not is not enough to reliably define the parameters. E.g. consider when a parallel_for is launched from a single task: formally, it is nesting but there is no parallelism on the outer level. Some parts of the task scheduler needs to be significantly reworked to be able to provide information about the number of busy workers at any given time in order to enable this idea.