reduction with OpenMP with SSE/AVX

Question

I want to do a reduction on an array using OpenMP and SIMD. I read that a reduction in OpenMP is equivalent to:

inline float sum_scalar_openmp2(const float a[], const size_t N) {
    float sum = 0.0f;
    #pragma omp parallel
    {
        float sum_private = 0.0f;
        #pragma omp parallel for nowait
        for(int i=0; i<N; i++) {
            sum_private += a[i];
        }
        #pragma omp atomic
        sum += sum_private;
    }
    return sum;
}

I got this idea from the follow link: http://bisqwit.iki.fi/story/howto/openmp/#ReductionClause But atomic also does not support complex operators. What I did was replace atomic with critical and implemented the reduction with OpenMP and SSE like this:

#define ROUND_DOWN(x, s) ((x) & ~((s)-1))
inline float sum_vector4_openmp(const float a[], const size_t N) {
    __m128 sum4 = _mm_set1_ps(0.0f);
    #pragma omp parallel 
    {
        __m128 sum4_private = _mm_set1_ps(0.0f);
        #pragma omp for nowait
        for(int i=0; i < ROUND_DOWN(N, 4); i+=4) {
            __m128 a4 = _mm_load_ps(a + i);
            sum4_private = _mm_add_ps(a4, sum4_private);
        }
        #pragma omp critical
        sum4 = _mm_add_ps(sum4_private, sum4);
    }
    __m128 t1 = _mm_hadd_ps(sum4,sum4);
    __m128 t2 = _mm_hadd_ps(t1,t1);
    float sum = _mm_cvtss_f32(t2);  
    for(int i = ROUND_DOWN(N, 4); i < N; i++) {
        sum += a[i];
    }
    return sum;
}

However, this function does not perform as well as I hope. I'm using Visual Studio 2012 Express. I know I can improve the performance a bit by unrolling the SSE load/add a few times but that still is less than I expect.

I get much better performance by running over slices of the arrays equal to the number of threads:

inline float sum_slice(const float a[], const size_t N) {
    int nthreads = 4;
    const int offset = ROUND_DOWN(N/nthreads, nthreads);
    float suma[8] = {0};
    #pragma omp parallel for num_threads(nthreads) 
    for(int i=0; i<nthreads; i++) {
        suma[i] = sum_vector4(&a[i*offset], offset);
    }
    float sum = 0.0f;
    for(int i=0; i<nthreads; i++) {
        sum += suma[i]; 
    }
    for(int i=nthreads*offset; i < N; i++) {
        sum += a[i];
    }
    return sum;    
}

inline float sum_vector4(const float a[], const size_t N) {
    __m128 sum4 = _mm_set1_ps(0.0f);
    int i = 0;
    for(; i < ROUND_DOWN(N, 4); i+=4) {
        __m128 a4 = _mm_load_ps(a + i);
        sum4 = _mm_add_ps(sum4, a4);
    }
    __m128 t1 = _mm_hadd_ps(sum4,sum4);
    __m128 t2 = _mm_hadd_ps(t1,t1);
    float sum = _mm_cvtss_f32(t2);
    for(; i < N; i++) {
        sum += a[i];
    }
    return sum;

}

Does someone know if there is a better way of doing reductions with more complicated operators in OpenMP?

I do the tests on the cache sizes: N(L1) = 32k, N(L2) = 256K, N(L3/4) = 2M, and N>>N(L3). — user2088790

veda veda · Accepted Answer · 2013-03-16T04:51:20

I guess the answer to your question is No. I don't think there is a better way of doing reduction with more complicated operators in OpenMP.

Assuming that the array is 16 bit aligned, number of openmp threads is 4, one might expect the performance gain to be 12x - 16x by OpenMP + SIMD. In realistic, it might not produce enough performance gain because

There is a overhead in creating the openmp threads.
The code is doing 1 addition operation for 1 Load operation. Hence, the CPU isn't doing enough computation. So, it almost looks like the CPU spends most of the time in loading the data, kind of memory bandwidth bound.

reduction with OpenMP with SSE/AVX

1 Answers