parallel prefix (cumulative) sum with SSE

Question

I'm looking for some advice on how to do a parallel prefix sum with SSE. I'm interested in doing this on an array of ints, floats, or doubles.

I have come up with two solutions. A special case and a general case. In both cases the solution runs over the array in two passes in parallel with OpenMP. For the special case I use SSE on both passes. For the general case I use it only on the second pass.

My main question is how I can use SSE on the first pass in the general case? The following link simd-prefix-sum-on-intel-cpu show an improvement for bytes but not for 32bit data types.

The reason the special case is called special is that it requires the array be in a special format. For example let's assume there were only 16 elements of an arrayaof floats. Then if the array was rearranged like this (array of structs to struct of arrays):

a[0] a[1] ...a[15] -> a[0] a[4] a[8] a[12] a[1] a[5] a[9] a[13]...a[3] a[7] a[11] a[15]

SSE vertical sums could be used on both passes. However, this would only be efficient if the arrays were already in the special format and the output could be used in the special format. Otherwise expensive rearrange would have to be done on both input and output which would make it much slower than the general case.

Maybe I should consider a different algorithm for the prefix sum (e.g. a binary tree)?

Code for the general case:

void prefix_sum_omp_sse(double a[], double s[], int n) {
    double *suma;
    #pragma omp parallel
    {
        const int ithread = omp_get_thread_num();
        const int nthreads = omp_get_num_threads();
        #pragma omp single
        {
            suma = new double[nthreads + 1];
            suma[0] = 0;
        }
        double sum = 0;
        #pragma omp for schedule(static) nowait //first parallel pass
        for (int i = 0; i<n; i++) {
            sum += a[i];
            s[i] = sum;
        }
        suma[ithread + 1] = sum;
        #pragma omp barrier
        #pragma omp single
        {
            double tmp = 0;
            for (int i = 0; i<(nthreads + 1); i++) {
                tmp += suma[i];
                suma[i] = tmp;
            }
        }
        __m128d offset = _mm_set1_pd(suma[ithread]);
        #pragma omp for schedule(static) //second parallel pass with SSE as well
        for (int i = 0; i<n/4; i++) {       
            __m128d tmp1 = _mm_load_pd(&s[4*i]);
            tmp1 = _mm_add_pd(tmp1, offset);    
            __m128d tmp2 = _mm_load_pd(&s[4*i+2]);
            tmp2 = _mm_add_pd(tmp2, offset);
            _mm_store_pd(&s[4*i], tmp1);
            _mm_store_pd(&s[4*i+2], tmp2);
        }
    }
    delete[] suma;
}

I though compiler like gcc/icc can do auto-vectorization for the second part, so that you don't need to use SIMD intrinsics. Do you get performance improvement, compare to the plain c code with some compiler options like -msse2 — kangshiyin
They might. I rand this on MSVC2013. It does not auto-vectorize the second pass. My experience with MSVC is that when you use OpenMP you have to do the vectorization yourself. I don't think any of them will unroll the loop with the SSE code for you but it does not help in this case anyway. — Z boson
In response to the question on performance, the general code I posted is over 3 times faster than the sequential code in release mode with AVX enabled on my 4 core ivy bridge system. The time cost should be n/ncores*(1+1/SIMD_width). So for 4 cores and SIMD_width=2 (double) that should be 3n/8. That's about a 2.7 times speed up. Hyper-threading helps a bit so I guess that's pushing it over 3 (I'm using 8 threads. When I try 4 threads the performance drops a bit). — Z boson
You might want to mention that the input and output arrays need to be 16-byte aligned due to the use of _mm_load_ps, but a float * will in the general case only be 4-byte aligned. — BeeOnRope

Z boson Z boson · Accepted Answer · 2013-10-22T13:28:57

This is the first time I'm answering my own question but it seems appropriate. Based on hirschhornsalz answer for prefix sum on 16 bytes simd-prefix-sum-on-intel-cpu I have come up with a solution for using SIMD on the first pass for 4, 8, and 16 32-bit words.

The general theory goes as follows. For a sequential scan of n words it takes n additions (n-1 to scan the n words and one more addition carried from the previous set of words scanned). However using SIMD n words can be scanned in log₂(n) additions and an equal number of shifts plus one more addition and broadcast to carry from the previous SIMD scan. So for some value of n the SIMD method will win.

Let's look at 32-bit words with SSE, AVX, and AVX-512:

4 32-bit words (SSE):      2 shifts, 3 adds, 1 broadcast       sequential: 4 adds
8 32-bit words (AVX):      3 shifts, 4 adds, 1 broadcast       sequential: 8 adds
16 32 bit-words (AVX-512): 4 shifts, 5 adds, 1 broadcast       sequential: 16 adds

Based on that it appears SIMD won't be useful for a scan for 32-bit words until AVX-512. This also assumes that the shifts and broadcast can be done in only 1 instruction. This is true for SSE but not for AVX and maybe not even for AVX2.

In any case I put together some working and tested code which does a prefix sum using SSE.

inline __m128 scan_SSE(__m128 x) {
    x = _mm_add_ps(x, _mm_castsi128_ps(_mm_slli_si128(_mm_castps_si128(x), 4))); 
    x = _mm_add_ps(x, _mm_castsi128_ps(_mm_slli_si128(_mm_castps_si128(x), 8)));
    return x;
}

void prefix_sum_SSE(float *a, float *s, const int n) {
__m128 offset = _mm_setzero_ps();
for (int i = 0; i < n; i+=4) {
    __m128 x = _mm_load_ps(&a[i]);
    __m128 out = scan_SSE(x);
    out = _mm_add_ps(out, offset);
    _mm_store_ps(&s[i], out);
    offset = _mm_shuffle_ps(out, out, _MM_SHUFFLE(3, 3, 3, 3)); 
}

Notice that the scan_SSE function has two additions (_mm_add_ps) and two shifts (_mm_slli_si128). The casts are only used to make the compiler happy and don't get converted to instructions. Then inside the main loop over the array in prefix_sum_SSE another addition and one shuffle is used. That's 6 operations total compared to only 4 additions with the sequential sum.

Here is a working solution for AVX:

inline __m256 scan_AVX(__m256 x) {
    __m256 t0, t1;
    //shift1_AVX + add
    t0 = _mm256_permute_ps(x, _MM_SHUFFLE(2, 1, 0, 3));
    t1 = _mm256_permute2f128_ps(t0, t0, 41);
    x = _mm256_add_ps(x, _mm256_blend_ps(t0, t1, 0x11));
    //shift2_AVX + add
    t0 = _mm256_permute_ps(x, _MM_SHUFFLE(1, 0, 3, 2));
    t1 = _mm256_permute2f128_ps(t0, t0, 41);
    x = _mm256_add_ps(x, _mm256_blend_ps(t0, t1, 0x33));
    //shift3_AVX + add
    x = _mm256_add_ps(x,_mm256_permute2f128_ps(x, x, 41));
    return x;
}

void prefix_sum_AVX(float *a, float *s, const int n) {
    __m256 offset = _mm256_setzero_ps();
    for (int i = 0; i < n; i += 8) {
        __m256 x = _mm256_loadu_ps(&a[i]);
        __m256 out = scan_AVX(x);
        out = _mm256_add_ps(out, offset);
        _mm256_storeu_ps(&s[i], out);
        //broadcast last element
        __m256 t0 = _mm256_permute2f128_ps(out, out, 0x11);
        offset = _mm256_permute_ps(t0, 0xff);
    }   
}

The three shifts need 7 intrinsics. The broadcast needs 2 intrinsics. So with the 4 additions that's 13 intrinisics. For AVX2 only 5 intrinsics are needed for the shifts so 11 intrinsics total. The sequential sum only needs 8 additions. Therefore likely neither AVX nor AVX2 will be useful for the first pass.

Edit:

So I finally benchmarked this and the results are unexpected. The SSE and AVX code are both about twice as fast as the following sequential code:

void scan(float a[], float s[], int n) {
    float sum = 0;
    for (int i = 0; i<n; i++) {
        sum += a[i];
        s[i] = sum;
    }
}

I guess this is due to instruction level paralellism.

So that answers my own question. I succeeded in using SIMD for pass1 in the general case. When I combine this with OpenMP on my 4 core ivy bridge system the total speed up is about seven for 512k floats.

parallel prefix (cumulative) sum with SSE

1 Answers