OpenMP On-Demand Nested Parallelism

Question

I have a list of jobs, which I am processing in parallel with OpenMP:

void processAllJobs()
{
#pragma omp parallel for
    for(int i = 0; i < n; ++i) 
        processJob(i);
}

All jobs have some sequential parts and parts that could be parallelized if called alone:

void processJob(int i)
{
    for(int iteration = 0; iteration < iterationCount; ++iteration)
    {
        doSomePreparation(i);
        std::vector<Subtask> subtasks = getSubtasks(i);
#pragma omp parallel for
        for(int j = 0; j < substasks.size(); ++j)
            subtasks[j].Process();
        doSomePostProcessing(i)
    }
}

When I run processAllJobs(), threads are created for the outer loop (over each job) and the inner loop (over the subtasks) are done sequentially within the thread. This is all fine and intended.

Sometimes there are very large jobs that take a lot of time to process. Long enough, such that all other threads in the outer loop already finish way before the last thread and don't do anything. Is there a way to re-purpose the unused threads to parallelize the inner loop as soon as they are finished? I imagine something that checks the number of unused threads each time the inner parallel region is entered.

I cannot predict how long a job runs. It might not only be one long-lasting job - maybe there are two or three.

You could use dynamic scheduling in the outer loop with a small number of threads. And use nested parallelism in the inner loop also controlling the number of threads. If your total number of threads is 16, you can try num_thread(4) in both case. With dynamic scheduling, fast threads will end early and you can process several small chunks while a long processing takes place. With nested parallelism you guarantee that several threads will be used for long jobs. — Alain Merigot
Try task loops for both levels. Whether or not that will be of benefit is impossible to tell from the information in the question. — Zulan
@Alain: Thanks for the suggestion. That would probably help a bit. Although it will somehow shift the problem. In the example with a single long-lasting job, if there were 15 idle threads with the initial approach, there would be 12 idle threads with the modified approach (with four working threads). And the sequential parts would not benefit from maximum parallelization. — Nico Schertler

Michael Klemm Michael Klemm · Accepted Answer · 2019-03-27T07:07:53

Your description of the problem sounds more like OpenMP tasking will be a much better choice. Your code would then look like this:

void processAllJobs()
{
#pragma omp parallel master
    for(int i = 0; i < n; ++i) 
#pragma omp task
        processJob(i);
}

Then the processing of the job would look like this:

void processJob(int i)
{
    for(int iteration = 0; iteration < iterationCount; ++iteration)
    {
        doSomePreparation(i);
        std::vector<Subtask> subtasks = getSubtasks(i);
#pragma omp taskloop   // add grainsize() clause, if Process() is very short
        for(int j = 0; j < substasks.size(); ++j)
            subtasks[j].Process();
        doSomePostProcessing(i)
    }
}

That way you get natural load balancing (assuming that you have enough tasks) without having to rely on nested parallelism.

OpenMP On-Demand Nested Parallelism

1 Answers