I am summing two arrays and outputing a third array (not a reduction). Like this:
void add_scalar(float* result, const float* a, const float* b, const int N) {
for(int i = 0; i<N; i++) {
result[i] = a[i] + b[i];
}
}
I want to do this with the maximum throughput. With SSE and four cores I naively expect a speed up of 16 times (four for SSE and four for the four cores). I have implemented the code with SSE (and AVX). Visual studio 2012 has auto-vectorization but I get better results by "unrolling the loop". I run my code for arrays (with 32 byte alignment) with four sizes: less than 32KB, less than 256KB, less than 8MB, and greater than 8 MB coressponding to the L1, L2, L3 Caches, and Main memory. For L1 I see about a 4x speedup using my unrolled SSE code (5-6 with AVX). That's as much as I expect. The efficiency drops for each cache level after that. Then I use OpenMP to run on each core. I put "#pragma omp parallel for" before my main loop over the array. However, the best speedup I get is 5-6 times with SSE + OpenMP. Does anyone have a clue why I'm not seeing a speedup of 16x? Maybe it's due to some "upload" time of the array from system memory to the cache? I realize I should profile the code but that's another adventure in itself that I have to learn.
#define ROUND_DOWN(x, s) ((x) & ~((s)-1))
void add_vector(float* result, const float* a, const float* b, const int N) {
__m128 a4;
__m128 b4;
__m128 sum;
int i = 0;
for(; i < ROUND_DOWN(N, 8); i+=8) {
a4 = _mm_load_ps(a + i);
b4 = _mm_load_ps(b + i);
sum = _mm_add_ps(a4, b4);
_mm_store_ps(result + i, sum);
a4 = _mm_load_ps(a + i + 4);
b4 = _mm_load_ps(b + i + 4);
sum = _mm_add_ps(a4, b4);
_mm_store_ps(result + i + 4, sum);
}
for(; i < N; i++) {
result[i] = a[i] + b[i];
}
return 0;
}
My wrong main loop with a race condition something like this:
float *a = (float*)_aligned_malloc(N*sizeof(float), 32);
float *b = (float*)_aligned_malloc(N*sizeof(float), 32);
float *c = (float*)_aligned_malloc(N*sizeof(float), 32);
#pragma omp parallel for
for(int n=0; n<M; n++) { //M is an integer of the number of times to run over the array
add_vector(c, a, b, N);
}
My corrected main loop based on Grizzly's suggestions:
for(int i=0; i<4; i++) {
results[i] = (float*)_aligned_malloc(N*sizeof(float), 32);
}
#pragma omp parallel for num_threads(4)
for(int t=0; t<4; t++) {
for(int n=0; n<M/4; n++) { //M is an integer of the number of times to run over the array
add_vector(results[t], a, b, N);
}
}