Partially parallel loops using openmp tasks

Question

Prerequisites:

parallel engine: OpenMP 3.1+ (can be OpenMP 4.0 if needed)
parallel constructs: OpenMP tasks
compiler: gcc 4.9.x (supports OpenMP 4.0)

Input:

C code with loops
loop have cross-iteration data dependency(ies): “i+1“ iteration needs data from “i” iteration (only such kind of dependency, nothing else)
loop body can be partially dependent
loop cannot be split in two loops; loop body should remain solid
anything reasonable can be added to loop or loop body function definition

Code sample:

(Here conf/config/configData variables are used for illustration purposes only, the main interest is within value/valueData variables.)

void loopFunc(const char* config, int* value)
{
    int conf;
    conf = prepare(config);         // independent, does not change “config”
    *value = process(conf, *value); // dependent, takes prev., produce next
    return;
}

int main()
{
    int N = 100;
    char* configData;           // never changes
    int valueData = 0;          // initial value
    …
    for (int i = 0; i < N; i++)
    {
        loopFunc(configData, &valueData);
    }
    …
}

Need to:

parallelise loop using omp tasks (omp for / omp sections cannot be used)
“prepare” functions should be executed in parallel with other “prepare” or “process” functions
“process” functions should be ordered according to data dependency

What have been proposed and implemented:

define integer flag
assign to it a number of first iteration
every iteration when it needs data waits for flag to be equal to it’s iteration
update flag value when data for next iteration is ready

Like this:

(I reminds that conf/config/configData variables are used for illustration purposes only, the main interest is within value/valueData variables.)

void loopFunc(const char* config, int* value, volatile int *parSync, int iteration)
{
    int conf;
    conf = prepare(config);         // independent, do not change “config”
    while (*parSync != iteration)   // wait for previous to be ready
    {
        #pragma omp taskyield
    }
    *value = process(conf, *value); // dependent, takes prev., produce next
    *parSync = iteration + 1;       // inform next about readiness
    return;
}

int main()
{
    int N = 100;
    char* configData;           // never changes
    int valueData = 0;          // initial value
    volatile int parallelSync = 0;
    …
    omp_set_num_threads(5);
    #pragma omp parallel
    #pragma omp single
    for (int i = 0; i < N; i++)
    {
        #pragma omp task shared(configData, valueData, parallelSync) firstprivate(i)
            loopFunc(configData, &valueData, &parallelSync, i);
    }
    #pragma omp taskwait
    …
}

What happened:

It fails. :)

The reason was that openmp task occupies openmp thread. For example, if we define 5 openmp threads (as in the code above).

“For” loop generates 100 tasks.
OpenMP runtime assign 5 arbitrary tasks to 5 threads and starts these tasks.

If there will be no task with i=0 among started tasks (it happens time to time), executing tasks wait forever, occupy threads forever and the task with i=0 never being started.

What's next?

I have no other ideas how to implement the required mode of computation.

Current solution

Thanks for the idea to @parallelgeek below

int main()
{
    int N = 10;
    char* configData;           // never changes
    int valueData = 0;          // initial value
    volatile int parallelSync = 0;
    int workers;
    volatile int workingTasks = 0;
    ...
    omp_set_num_threads(5);
    #pragma omp parallel
    #pragma omp single
    {
        workers = omp_get_num_threads()-1;  // reserve 1 thread for task generation

        for (int i = 0; i < N; i++)
        {
            while (workingTasks >= workers)
            {
                #pragma omp taskyield
            }

            #pragma omp atomic update
                workingTasks++;

            #pragma omp task shared(configData, valueData, parallelSync, workingTasks) firstprivate(i)
            {
                loopFunc(configData, &valueData, &parallelSync, i);

                #pragma omp atomic update
                    workingTasks--;
            }
        }
        #pragma omp taskwait
    }
}

Let's see, you declare char* configData;, then you pass it to loopFunc() inside loopFunc() you pass it to process(conf, value);*. Q: what exactly is(does) process() ? — Michi
@Michi: Here conf/config/configData variables are used for illustration purposes only, the main interest is within value/valueData variables. Updated the post. "Prepare" illustrates value-independent function which takes time but does not interact with other iterations. "Process" illustrates function that takes data from previous iteration and produces new value for the next iteration, — Badiboy
@Gilles. "int &xxx", AFAIK, can be used only in C++? I need to be C-compatible... Thus "flush" seems to be impossible at all, as if omp does not allows to flush dereferenced values like flush(*parSync)... — Badiboy
Another thing that bugs me is that the standard says "Whenever a thread reaches a task scheduling point, the implementation may cause it to perform a task switch, beginning or resuming execution of a different task bound to the current team". No obligations here, so your code's behaviour is implementation define anyway, isn't it? But just in case, I would also play with the OMP_WAIT_POLICY to set it to PASSIVE in case it allows for better task scheduling... — Gilles
@Gilles, yes, it's behaviour of the solution is unpredictable and this is the main problem. That's why I raise the question here.. — Badiboy

parallelgeek parallelgeek · Accepted Answer · 2015-09-25T11:51:25

AFAIK volatiles don't prevent hardware reordering, that's why you could end up with a mess in memory, because data is not written yet, while flag is already seen by the consuming thread as true.
That's why little piece of advise: use C11 atomics instead in order to ensure visibility of data. As I can see, gcc 4.9 supports c11 C11Status in GCC
You could try to divide generated tasks to groups by K tasks, where K == ThreadNum and start generating subsequent task (after the tasks in the first group are generated) only after any of running tasks is finished. Thus you have an invariant that each time you have only K tasks running and scheduled on K threads.
Intertask dependencies could also be met by using atomic flags from C11.

Partially parallel loops using openmp tasks

1 Answers