I'm looking for some advice on how to do a parallel prefix sum with SSE. I'm interested in doing this on an array of ints, floats, or doubles.
I have come up with two solutions. A special case and a general case. In both cases the solution runs over the array in two passes in parallel with OpenMP. For the special case I use SSE on both passes. For the general case I use it only on the second pass.
My main question is how I can use SSE on the first pass in the general case? The following link simd-prefix-sum-on-intel-cpu show an improvement for bytes but not for 32bit data types.
The reason the special case is called special is that it requires the array be in a special format. For example let's assume there were only 16 elements of an arraya
of floats. Then if the array was rearranged like this (array of structs to struct of arrays):
a[0] a[1] ...a[15] -> a[0] a[4] a[8] a[12] a[1] a[5] a[9] a[13]...a[3] a[7] a[11] a[15]
SSE vertical sums could be used on both passes. However, this would only be efficient if the arrays were already in the special format and the output could be used in the special format. Otherwise expensive rearrange would have to be done on both input and output which would make it much slower than the general case.
Maybe I should consider a different algorithm for the prefix sum (e.g. a binary tree)?
Code for the general case:
void prefix_sum_omp_sse(double a[], double s[], int n) {
double *suma;
#pragma omp parallel
{
const int ithread = omp_get_thread_num();
const int nthreads = omp_get_num_threads();
#pragma omp single
{
suma = new double[nthreads + 1];
suma[0] = 0;
}
double sum = 0;
#pragma omp for schedule(static) nowait //first parallel pass
for (int i = 0; i<n; i++) {
sum += a[i];
s[i] = sum;
}
suma[ithread + 1] = sum;
#pragma omp barrier
#pragma omp single
{
double tmp = 0;
for (int i = 0; i<(nthreads + 1); i++) {
tmp += suma[i];
suma[i] = tmp;
}
}
__m128d offset = _mm_set1_pd(suma[ithread]);
#pragma omp for schedule(static) //second parallel pass with SSE as well
for (int i = 0; i<n/4; i++) {
__m128d tmp1 = _mm_load_pd(&s[4*i]);
tmp1 = _mm_add_pd(tmp1, offset);
__m128d tmp2 = _mm_load_pd(&s[4*i+2]);
tmp2 = _mm_add_pd(tmp2, offset);
_mm_store_pd(&s[4*i], tmp1);
_mm_store_pd(&s[4*i+2], tmp2);
}
}
delete[] suma;
}
-msse2
– kangshiyinn/ncores*(1+1/SIMD_width)
. So for 4 cores and SIMD_width=2 (double) that should be 3n/8. That's about a 2.7 times speed up. Hyper-threading helps a bit so I guess that's pushing it over 3 (I'm using 8 threads. When I try 4 threads the performance drops a bit). – Z boson_mm_load_ps
, but afloat *
will in the general case only be 4-byte aligned. – BeeOnRope