3
votes

I have the following loop that I'd like to accelerate using #pragma omp simd:

#define N 1024
double* data = new double[N];
// Generate data, not important how.

double mean = 0.0
for (size_t i = 0; i < N; i++) {
    mean += (data[i] - mean) / (i+1);
}

As I expected, just putting #pragma omp simd directly before the loop has no impact (I'm examining running times). I can tackle the multi-threaded case easily enough using #pragma omp parallel for reduction(...) with a custom reducer as shown below, but how do I put OpenMP SIMD to use here?

I'm using the following class for implementing the + and += operators for adding a double to a running mean as well as combining two running means:

class RunningMean {
    private:
        double mean;
        size_t count;

    public:
        RunningMean(): mean(0), count(0) {}
        RunningMean(double m, size_t c): mean(m), count(c) {}

        RunningMean operator+(RunningMean& rhs) {
            size_t c = this->count + rhs.count;
            double m = (this->mean*this->count + rhs.mean*rhs.count) / c;
            return RunningMean(m, c);
        }

        RunningMean operator+(double rhs) {
            size_t c = this->count + 1;
            double m = this->mean + (rhs - this->mean) / c;
            return RunningMean(m, c);
        }

        RunningMean& operator+=(const RunningMean& rhs) {
            this->mean = this->mean*this->count + rhs.mean*rhs.count;
            this->count += rhs.count;
            this->mean /= this->count;
            return *this;
        }

        RunningMean& operator+=(double rhs) {
            this->count++;
            this->mean += (rhs - this->mean) / this->count;
            return *this;
        }

        double getMean() { return mean; }
        size_t getCount() { return count; }
};

The maths for this comes from http://prod.sandia.gov/techlib/access-control.cgi/2008/086212.pdf. For multi-threaded, non-SIMD parallel reduction I do the following:

#pragma omp declare reduction (runningmean : RunningMean : omp_out += omp_in)
RunningMean mean;
#pragma omp parallel for reduction(runningmean:mean)
for (size_t i = 0; i < N; i++)
    mean += data[i];

This gives me a 3.2X speedup on my Core i7 2600k using 8 threads.

If I was to implement the SIMD myself without OpenMP, I would just maintain 4 means in a vector, 4 counts in another vector (assuming the use of AVX instructions) and keep on adding 4-element double precision vectors using a vectorised version of operator+(double rhs). Once that is done, I would add the resulting 4 pairs of means and counts using the maths from operator+=. How can I instruct OpenMP to do this?

2
Your custom reducer has recursion, that I guess the pragma cannot solve... What about: mean += data[i] / N?GHL
@GHL This method of calculating the mean gives one a running total which can then be used in other methods, e.g. for variance that give a running variance, i.e. variance after i iterations. Have a look at this Wikipedia article. For variance, this works out to be more numerically stable than the normal textbook definition, which is why I'm interested in it.chippies
Understood, but how do you want the SIMD to speed up your calculation, as to compute each running sum (ie each iteration of the loop) you need the result of the previous one? I would say that for SIMD to work you need your calculations to be independent?GHL
Can you please explain how can "tackle the multi-threaded case easily enough using #pragma omp parallel for reduction(...) with a custom reducer"? I would like to see that.Z boson
@chippies: can you present the code you actually wrote and it failed?Tigran

2 Answers

2
votes

The problem is that

mean += (data[i] - mean) / (i+1);

is not easily amenable to SIMD. However, by studying the math carefully it's possible to vectorized this without too much effort.

The key forumla is

mean(n+m) = (n*mean(n) + m*mean(m))/(n+m)

which shows how to add the means of n numbers and the mean of m numbers. This can be seen in your operator definition RunningMean operator+(RunningMean& rhs). This explains why your parallel code works. I think this is more clear if we deconvolute your C++ code:

double mean = 0.0;
int count = 0;
#pragma omp parallel
{
    double mean_private = 0.0;
    int count_private = 0;
    #pragma omp for nowait
    for(size_t i=0; i<N; i++) {
        count_private ++;
        mean_private += (data[i] - mean_private)/count_private;
    }
    #pragma omp critical
    {
        mean = (count_private*mean_private + count*mean);
        count += count_private;
        mean /= count;
    }
}

But we can use the same idea with SIMD (and combine them together). But let's first do the SIMD only part. Using AVX we can handle four parallel means at once. Each parallel mean will handle the following data elements:

mean 1 data elements: 0,  4,  8, 12,...
mean 2 data elements: 1,  5,  9, 13,...
mean 3 data elements: 2,  6, 10, 14,...
mean 4 data elements: 3,  7, 11, 15,...

One we have looped over all the elements then we add the four parallel sums together and divide by four (since each sum runs over N/4 elements).

Here is the code to to this

double mean = 0.0;
__m256d mean4 = _mm256_set1_pd(0.0);
__m256d count4 = _mm256_set1_pd(0.0);
for(size_t i=0; i<N/4; i++) {
    count4 = _mm256_add_pd(count4,_mm256_set1_pd(1.0));
    __m256d t1 = _mm256_loadu_pd(&data[4*i]);
    __m256d t2 = _mm256_div_pd(_mm256_sub_pd(t1, mean4), count4);
    mean4 = _mm256_add_pd(t2, mean4);   
}
__m256d t1 = _mm256_hadd_pd(mean4,mean4);
__m128d t2 = _mm256_extractf128_pd(t1,1);
__m128d t3 = _mm_add_sd(_mm256_castpd256_pd128(t1),t2);
mean = _mm_cvtsd_f64(t3)/4;
int count = 0;
double mean2 = 0;
for(size_t i=4*(N/4); i<N; i++) {
    count++;
    mean2 += (data[i] - mean2)/count;
}
mean = (4*(N/4)*mean + count*mean2)/N;

Finally, we can combine this with OpenMP to get the full benefit of SIMD and MIMD like this

double mean = 0.0;
int count = 0;
#pragma omp parallel 
{
    double mean_private = 0.0;
    int count_private = 0;
    __m256d mean4 = _mm256_set1_pd(0.0);
    __m256d count4 = _mm256_set1_pd(0.0);
    #pragma omp for nowait
    for(size_t i=0; i<N/4; i++) {
        count_private++;
        count4 = _mm256_add_pd(count4,_mm256_set1_pd(1.0));
        __m256d t1 = _mm256_loadu_pd(&data[4*i]);
        __m256d t2 = _mm256_div_pd(_mm256_sub_pd(t1, mean4), count4);
        mean4 = _mm256_add_pd(t2, mean4);
    }
    __m256d t1 = _mm256_hadd_pd(mean4,mean4);
    __m128d t2 = _mm256_extractf128_pd(t1,1);
    __m128d t3 = _mm_add_sd(_mm256_castpd256_pd128(t1),t2);
    mean_private = _mm_cvtsd_f64(t3)/4;

    #pragma omp critical
    {
        mean = (count_private*mean_private + count*mean);
        count += count_private;
        mean /= count;
    }   
}
int count2 = 0;
double mean2 = 0;
for(size_t i=4*(N/4); i<N; i++) {
    count2++;
    mean2 += (data[i] - mean2)/count2;
}
mean = (4*(N/4)*mean + count2*mean2)/N;

And here is a working example (compile with -O3 -mavx -fopenmp)

#include <stdio.h>
#include <stdlib.h>
#include <x86intrin.h>

double mean_simd(double *data, const int N) {
    double mean = 0.0;
    __m256d mean4 = _mm256_set1_pd(0.0);
    __m256d count4 = _mm256_set1_pd(0.0);
    for(size_t i=0; i<N/4; i++) {
        count4 = _mm256_add_pd(count4,_mm256_set1_pd(1.0));
        __m256d t1 = _mm256_loadu_pd(&data[4*i]);
        __m256d t2 = _mm256_div_pd(_mm256_sub_pd(t1, mean4), count4);
        mean4 = _mm256_add_pd(t2, mean4);           
    }
    __m256d t1 = _mm256_hadd_pd(mean4,mean4);
    __m128d t2 = _mm256_extractf128_pd(t1,1);
    __m128d t3 = _mm_add_sd(_mm256_castpd256_pd128(t1),t2);
    mean = _mm_cvtsd_f64(t3)/4;
    int count = 0;
    double mean2 = 0;
    for(size_t i=4*(N/4); i<N; i++) {
        count++;
        mean2 += (data[i] - mean2)/count;
    }
    mean = (4*(N/4)*mean + count*mean2)/N;
    return mean;
}

double mean_simd_omp(double *data, const int N) {
    double mean = 0.0;
    int count = 0;
    #pragma omp parallel 
    {
        double mean_private = 0.0;
        int count_private = 0;
        __m256d mean4 = _mm256_set1_pd(0.0);
        __m256d count4 = _mm256_set1_pd(0.0);
        #pragma omp for nowait
        for(size_t i=0; i<N/4; i++) {
            count_private++;
            count4 = _mm256_add_pd(count4,_mm256_set1_pd(1.0));
            __m256d t1 = _mm256_loadu_pd(&data[4*i]);
            __m256d t2 = _mm256_div_pd(_mm256_sub_pd(t1, mean4), count4);
            mean4 = _mm256_add_pd(t2, mean4);
        }
        __m256d t1 = _mm256_hadd_pd(mean4,mean4);
        __m128d t2 = _mm256_extractf128_pd(t1,1);
        __m128d t3 = _mm_add_sd(_mm256_castpd256_pd128(t1),t2);
        mean_private = _mm_cvtsd_f64(t3)/4;

        #pragma omp critical
        {
            mean = (count_private*mean_private + count*mean);
            count += count_private;
            mean /= count;
        }   
    }
    int count2 = 0;
    double mean2 = 0;
    for(size_t i=4*(N/4); i<N; i++) {
        count2++;
        mean2 += (data[i] - mean2)/count2;
    }
    mean = (4*(N/4)*mean + count2*mean2)/N;
    return mean;
}


int main() {
    const int N = 1001;
    double data[N];

    for(int i=0; i<N; i++) data[i] = 1.0*rand()/RAND_MAX;
    float sum = 0; for(int i=0; i<N; i++) sum+= data[i]; sum/=N;
    printf("mean %f\n", sum);
    printf("mean_simd %f\n", mean_simd(data, N);
    printf("mean_simd_omp %f\n", mean_simd_omp(data, N));
}
0
votes

The KISS answer: Just calculate the mean outside the loop. Parallelize the following code:

double sum = 0.0;
for(size_t i = 0; i < N; i++) sum += data[i];
double mean = sum/N;

The sum is easily parallelizeable, but you won't see any effect of SIMD optimization: it is purely memory bound, the CPU will only be waiting for data from memory. If N is as small as 1024, there is even little point in parallelization, the synchronization overhead will eat up all the gains.