0
votes

I have the following code:

for (int i = 0; i < veryLargeArraySize; i++){
   int value = A[i];
   if (B[value] < MAX_VALUE) {
     B[value]++;
   }
 }

I want to use OpenMP worksharing construct here, but my issue is the synchronization on the B array - all parallel threads can access any element of array B, which is very large (which made use of locks difficult since I'd need too many of them)

#pragma omp critical is a serious overhead here. Atomic is not possible, because of the if.

Does anyone have a good suggestion on how I might do this?

1
How large is B compared to A?Zulan
@Zulan They both exceed 1 million elements, with A being slightly larger (it is a histogram calculation of an image - A is the pixelated image, B is the histogram)Petros17
I would assume that a histogram has significantly less bins than the number of original datapoints. Is the histogram very sparse?Zulan
Why truncate the count (B) inside the loop? Why not simply unconditionally add one, and then, in a separate phase loop over B and truncate the values. Then you can use an atomic.Jim Cownie
Possible duplicate of Calculate the histogram with OpenMPZulan

1 Answers

0
votes

Here's what I've found out and done.

I've read on some forums that parallel histogram calculation is generally a bad idea, since it may be slower and less efficient than the sequential calculation.

However, I needed to do it (for the assignment), so what I did is the following:

  1. Parallel processing of the A array(the image) to determine the actual range of values (the histogram - B array) - find MIN and MAX of A[i]

    int min_value, max_value;
    #pragma omp for reduction(min:min_value), reduction(max:max_value)
    for (i = 0; i < veryLargeArraySize; i++){
         const unsigned int value = A[i];
         if(max_value < value) max_value = value;
         if(min_value > value) min_value = value;
    }
    
    int size_of_histo = max_value - min_value + 1;`
    
  2. That way, we can (potentially) reduce the actual histogram size from, e.g., 1M elements (allocated in array B) to 50K elements (allocated in sharedHisto)
  3. Allocate a shared array, such as:

    int num_threads = omp_get_num_threads();
    int* sharedHisto = (int*) calloc(num_threads * size_of_histo, sizeof(int));
    
  4. Each thread is assigned a part of the sharedHisto, and can update it without synchronization

    int my_id = omp_get_thread_num();
    #pragma omp parallel for default(shared) private(i)
    for(i = 0; i < veryLargeArraySize; i++){
        int value = A[i];
        // my_id * size_of_histo positions to the begining of this thread's
        // part of sharedHisto .
        // i - min_value positions to the actual histo value
        sharedHisto[my_id * size_of_histo + i - min_value]++;
    }
    
  5. Now, perform a reduction (as stated here: Reducing on array in OpenMp)

    #pragma omp parallel
    {
       // Every thread is in charge for part of the reduced histogram
       // shared_histo with the size: size_of_histo
       int my_id = omp_get_thread_num();
       int num_threads = omp_get_num_threads();
       int chunk = (size_of_histo + num_threads - 1) / num_threads;
       int start = my_id * chunk;
       int end = (start + chunk > histo_dim) ? histo_dim : start + chunk;
    
       #pragma omp for default(shared) private(i, j)
       for(i = start; i < end; i++){
           for(j = 0; j < num_threads; j++){
               int value = B[i + minHistoValue] + sharedHisto[j * size_of_histo + i];
    
               if(value > MAX_VALUE) B[i + min_value] = MAX_VALUE;
               else B[i + min_value] = value;
           }
        }
     }