I try to optimise the following loop with OpenMP:
#pragma omp parallel for private(diff)
for (int j = 0; j < x.d; ++j) {
diff = x(example,j) - x(chosen_pts[ndx - 1],j);
#pragma omp atomic
d2 += diff * diff;
}
But it runs actually 4x slower than without #pragma.
EDIT
As Piotr S., coincoin and erenon pointed out, in my case x.d is so small, that's why parallelism makes my code run slower. I post the outer loop too, maybe there is some possibility for multithreading: (x.n is over 100 millions)
float sum_distribution = 0.0;
// look for the point that is furthest from any center
float max_dist = 0.0;
for (int i = 0; i < x.n; ++i) {
int example = dist2[i].second;
float d2 = 0.0, diff;
//#pragma omp parallel for private(diff) reduction(+:d2)
for (int j = 0; j < x.d; ++j) {
diff = x(example,j) - x(chosen_pts[ndx - 1],j);
d2 += diff * diff;
}
if (d2 < dist2[i].first) {
dist2[i].first = d2;
}
if (dist2[i].first > max_dist) {
max_dist = dist2[i].first;
}
sum_distribution += dist2[i].first;
}
If someone is interested, here is the whole function: https://github.com/ghamerly/baylorml/blob/master/fast_kmeans/general_functions.cpp#L169, but as I measured 85% of the elapsed time comes from this loop.
d2has become a bottleneck here: all threads will need to access the same memory. Faster would be to let each thread keep track of its own sum, adding these thread-specific sums together when the loop is done. I think you can do this by adding areduction(+,d2)to the first pragma - brm