I have the following loop that I'd like to accelerate using #pragma omp simd
:
#define N 1024
double* data = new double[N];
// Generate data, not important how.
double mean = 0.0
for (size_t i = 0; i < N; i++) {
mean += (data[i] - mean) / (i+1);
}
As I expected, just putting #pragma omp simd
directly before the loop has no impact (I'm examining running times). I can tackle the multi-threaded case easily enough using #pragma omp parallel for reduction(...)
with a custom reducer as shown below, but how do I put OpenMP SIMD to use here?
I'm using the following class for implementing the + and += operators for adding a double
to a running mean as well as combining two running means:
class RunningMean {
private:
double mean;
size_t count;
public:
RunningMean(): mean(0), count(0) {}
RunningMean(double m, size_t c): mean(m), count(c) {}
RunningMean operator+(RunningMean& rhs) {
size_t c = this->count + rhs.count;
double m = (this->mean*this->count + rhs.mean*rhs.count) / c;
return RunningMean(m, c);
}
RunningMean operator+(double rhs) {
size_t c = this->count + 1;
double m = this->mean + (rhs - this->mean) / c;
return RunningMean(m, c);
}
RunningMean& operator+=(const RunningMean& rhs) {
this->mean = this->mean*this->count + rhs.mean*rhs.count;
this->count += rhs.count;
this->mean /= this->count;
return *this;
}
RunningMean& operator+=(double rhs) {
this->count++;
this->mean += (rhs - this->mean) / this->count;
return *this;
}
double getMean() { return mean; }
size_t getCount() { return count; }
};
The maths for this comes from http://prod.sandia.gov/techlib/access-control.cgi/2008/086212.pdf. For multi-threaded, non-SIMD parallel reduction I do the following:
#pragma omp declare reduction (runningmean : RunningMean : omp_out += omp_in)
RunningMean mean;
#pragma omp parallel for reduction(runningmean:mean)
for (size_t i = 0; i < N; i++)
mean += data[i];
This gives me a 3.2X speedup on my Core i7 2600k using 8 threads.
If I was to implement the SIMD myself without OpenMP, I would just maintain 4 means in a vector, 4 counts in another vector (assuming the use of AVX instructions) and keep on adding 4-element double precision vectors using a vectorised version of operator+(double rhs)
. Once that is done, I would add the resulting 4 pairs of means and counts using the maths from operator+=
. How can I instruct OpenMP to do this?
mean += data[i] / N
? – GHLi
iterations. Have a look at this Wikipedia article. For variance, this works out to be more numerically stable than the normal textbook definition, which is why I'm interested in it. – chippies