2
votes

I am summing two arrays and outputing a third array (not a reduction). Like this:

void add_scalar(float* result, const float* a, const float* b, const int N) {   
    for(int i = 0; i<N; i++) {
        result[i] = a[i] + b[i];
    }
}

I want to do this with the maximum throughput. With SSE and four cores I naively expect a speed up of 16 times (four for SSE and four for the four cores). I have implemented the code with SSE (and AVX). Visual studio 2012 has auto-vectorization but I get better results by "unrolling the loop". I run my code for arrays (with 32 byte alignment) with four sizes: less than 32KB, less than 256KB, less than 8MB, and greater than 8 MB coressponding to the L1, L2, L3 Caches, and Main memory. For L1 I see about a 4x speedup using my unrolled SSE code (5-6 with AVX). That's as much as I expect. The efficiency drops for each cache level after that. Then I use OpenMP to run on each core. I put "#pragma omp parallel for" before my main loop over the array. However, the best speedup I get is 5-6 times with SSE + OpenMP. Does anyone have a clue why I'm not seeing a speedup of 16x? Maybe it's due to some "upload" time of the array from system memory to the cache? I realize I should profile the code but that's another adventure in itself that I have to learn.

#define ROUND_DOWN(x, s) ((x) & ~((s)-1))  
void add_vector(float* result, const float* a, const float* b, const int N) {
    __m128 a4;
    __m128 b4;
    __m128 sum;
    int i = 0;
    for(; i < ROUND_DOWN(N, 8); i+=8) {
        a4 = _mm_load_ps(a + i);
        b4 = _mm_load_ps(b + i);
        sum = _mm_add_ps(a4, b4);
        _mm_store_ps(result + i, sum);
        a4 = _mm_load_ps(a + i + 4);
        b4 = _mm_load_ps(b + i + 4);
        sum = _mm_add_ps(a4, b4);
        _mm_store_ps(result + i + 4, sum);
    }
    for(; i < N; i++) {
        result[i] = a[i] + b[i];
    }
    return 0;
}

My wrong main loop with a race condition something like this:

float *a = (float*)_aligned_malloc(N*sizeof(float), 32);
float *b = (float*)_aligned_malloc(N*sizeof(float), 32);
float *c = (float*)_aligned_malloc(N*sizeof(float), 32);
#pragma omp parallel for
for(int n=0; n<M; n++) {  //M is an integer of the number of times to run over the array
    add_vector(c, a, b, N);
}

My corrected main loop based on Grizzly's suggestions:

for(int i=0; i<4; i++) {
    results[i] = (float*)_aligned_malloc(N*sizeof(float), 32);
}
#pragma omp parallel for num_threads(4)
for(int t=0; t<4; t++) {
    for(int n=0; n<M/4; n++) { //M is an integer of the number of times to run over the array
        add_vector(results[t], a, b, N);
    }
}
1

1 Answers

6
votes

Disclaimer: Like you I haven't profiled the code, so I can not answer with absolute certainty.

Your problem is most likely related to Memory bandwidth or parallelization overhead.

Your loop is very computation-light, since it does 1 add for 3 memory operations, making you naturally limited by memory bandwidth (considering that ALU thoughput is much better then memory bandwidth in modern architectures). Therefore most of your time is spent transfering data.

If the data is small enough to fit the cache you could (theoretically) bind the openmp threads to specific cores and ensure the correct part of the vector is in the L1/L2 cache of the specific core, but that won't really help, unless you can parallelize the initialization (it doesn't really matter when you transfer the data, if you have to do it anyways). So you are taking a hit from fransfering data from one cores cache to another.

If the data doesn't fit the processor caches you are ultimately limited by the bandwidth to main memory. Due to prefetching one core might be able to almost max out the bandwidth for such an easy access pattern, giving you little room to grow.

The second point to remember is, that creation of a omp parallel construct and distributing the loop has a certain amount of overhead. For small datasets (The datasets fitting the L1/L2/L3 probably qualify) this overhead can easily be as high as the computation time itself, giving you little to no speedup.