0
votes

I try to optimise the following loop with OpenMP:

    #pragma omp parallel for private(diff)
    for (int j = 0; j < x.d; ++j) {
        diff = x(example,j) - x(chosen_pts[ndx - 1],j);
        #pragma omp atomic
        d2 += diff * diff;
    }

But it runs actually 4x slower than without #pragma.

EDIT

As Piotr S., coincoin and erenon pointed out, in my case x.d is so small, that's why parallelism makes my code run slower. I post the outer loop too, maybe there is some possibility for multithreading: (x.n is over 100 millions)

float sum_distribution = 0.0;
// look for the point that is furthest from any center
float max_dist = 0.0;

for (int i = 0; i < x.n; ++i) {
    int example = dist2[i].second;
    float d2 = 0.0, diff;
    //#pragma omp parallel for private(diff) reduction(+:d2)
    for (int j = 0; j < x.d; ++j) {
        diff = x(example,j) - x(chosen_pts[ndx - 1],j);

        d2 += diff * diff;
    }
    if (d2 < dist2[i].first) {
        dist2[i].first = d2;
    }

    if (dist2[i].first > max_dist) {
        max_dist = dist2[i].first;
    }

    sum_distribution += dist2[i].first;
}

If someone is interested, here is the whole function: https://github.com/ghamerly/baylorml/blob/master/fast_kmeans/general_functions.cpp#L169, but as I measured 85% of the elapsed time comes from this loop.

1
What is the value of x.d? - erenon
Looks like d2 has become a bottleneck here: all threads will need to access the same memory. Faster would be to let each thread keep track of its own sum, adding these thread-specific sums together when the loop is done. I think you can do this by adding a reduction(+,d2) to the first pragma - brm
@erenon x.d = 3 in my test - user1930254
@brm i removed the atomic and changed the first pragma to '#pragma omp parallel for private(diff) reduction(+:d2)' but same result - user1930254
@user1930254 creating threads consumes time as well, nor is thread syncrhonization free, is it worth doing this to execute 3 iterations? - Piotr Skotnicki

1 Answers

1
votes

Yes, the outer loop, as posted, can be parallelized with OpenMP. All variables modified in the loop are either local to an iteration or are used for aggregation over the loop. And I assume that calls to x() in the calculation of diff have no side effects.

To do aggregation in parallel correctly and efficiently, you need to use an OpenMP loop with reduction clause. For sum_distribution the reduction operation is +, and for max_dist it's max. So, adding the following pragma in front of the outer loop should do the job:

#pragma omp parallel for reduction(+:sum_distribution) reduction(max:max_dist)

Note that max as a reduction operation can only be used since OpenMP 3.1. It's not that new, so most OpenMP-enabled compilers already support it, but not all; or you might use an older version. So it makes sense to consult with the documentation for your compiler.