I have a list of jobs, which I am processing in parallel with OpenMP:
void processAllJobs()
{
#pragma omp parallel for
for(int i = 0; i < n; ++i)
processJob(i);
}
All jobs have some sequential parts and parts that could be parallelized if called alone:
void processJob(int i)
{
for(int iteration = 0; iteration < iterationCount; ++iteration)
{
doSomePreparation(i);
std::vector<Subtask> subtasks = getSubtasks(i);
#pragma omp parallel for
for(int j = 0; j < substasks.size(); ++j)
subtasks[j].Process();
doSomePostProcessing(i)
}
}
When I run processAllJobs()
, threads are created for the outer loop (over each job) and the inner loop (over the subtasks) are done sequentially within the thread. This is all fine and intended.
Sometimes there are very large jobs that take a lot of time to process. Long enough, such that all other threads in the outer loop already finish way before the last thread and don't do anything. Is there a way to re-purpose the unused threads to parallelize the inner loop as soon as they are finished? I imagine something that checks the number of unused threads each time the inner parallel region is entered.
I cannot predict how long a job runs. It might not only be one long-lasting job - maybe there are two or three.
dynamic
scheduling in the outer loop with a small number of threads. And use nested parallelism in the inner loop also controlling the number of threads. If your total number of threads is 16, you can trynum_thread(4)
in both case. With dynamic scheduling, fast threads will end early and you can process several small chunks while a long processing takes place. With nested parallelism you guarantee that several threads will be used for long jobs. – Alain Merigot