0
votes

I'm trying to parallelize a code. My code looks like this -

    #pragma omp parallel private(i,j,k)
    #pragma omp parallel for shared(A)
    for(k=0;k<100;<k++)
     for(i=1;i<1024;<i++)
      for(j=0;j<1024;<j++)
       A[i][j+1]=<< some expression involving elements of A[i-1][j-1] >>

On executing this code I'm getting a different result from serial execution of the loops. I'm unable to understand what I'm doing wrong.

I've also tried the collapse()

    #pragma omp parallel private(i,j,k)
    #pragma omp parallel for collapse(3) shared(A)
    for(k=0;k<100;<k++)
     for(i=1;i<1024;<i++)
      for(j=0;j<1024;<j++)
       A[i][j+1]=<< some expression involving elements of A[][] >>

Another thing I tried was having a #pragma omp parallel for before each loop instead of collapse().

The issue, as I think, is the data dependency. Any idea how to parallelize in case of data dependency?

2

2 Answers

0
votes

If this is really your use case, just parallelize for the outer loop, k, this should largely suffice for the modest parallelism that you have on common architectures.

If you want more, you'd have to re-write your loops such that you have an inner part that doesn't have the dependency. In your example case this is relatively easy, you'd have to process by "diagonals" (outer loop, sequential) and then inside the diagonals you'd be independent.

for (size_t d=0; d<nDiag(100); ++d) {
   size_t nPoints = somefunction(d);
#pragma omp parallel
   for (size_t p=0; p<nPoints; ++p) {
      size_t x = coX(p, d);
      size_t y = coY(p, d);
      ... your real code ...
   }

}

Part of this could be done automatically, but I don't think that such tools are already readily implemented in everydays OMP. This is an active line of research.

Also note the following

  • int is rarely a good idea for indices, in particular if you access matrices. If you have to compute the absolute position of an entry yourself (and you see that here you might be) this overflows easily. int usually is 32 bit wide and of these 32 you are even wasting one for the sign. In C, object sizes are computed with size_t, most of the times 64 bit wide and in any case the correct type chosen by your platform designer.
  • use local variables for loop indices and other temporaries, as you can see writing OMP pragmas becomes much easier, then. Locality is one key to parallelism. Help yourself and the compiler by expressing this correctly.
0
votes

You're only parallelizing the outer 'k' for loop. Every parallel thread is executing the 'i' and 'j' loops, and they're all writing into the same 'A' result. Since they're all reading and writing the same slots in A, the final result will be non-deterministic.

It's not clear from your problem that any parallelism is possible, since each step seems to depend on every previous step.