I'm learning openmp using the example of computing the value of pi via quadature. In serial, I run the following C code:
double serial() {
double step;
double x,pi,sum = 0.0;
step = 1.0 / (double) num_steps;
for (int i = 0; i < num_steps; i++) {
x = (i + 0.5) * step; // forward quadature
sum += 4.0 / (1.0 + x*x);
}
pi = step * sum;
return pi;
}
I'm comparing this to an omp implementation using a parallel for with reduction:
double SPMD_for_reduction() {
double step;
double pi,sum = 0.0;
step = 1.0 / (double) num_steps;
#pragma omp parallel for reduction (+:sum)
for (int i = 0; i < num_steps; i++) {
double x = (i + 0.5) * step;
sum += 4.0 / (1.0 + x*x);
}
pi = step * sum;
return pi;
}
For num_steps = 1,000,000,000, and 6 threads in the case of omp, I compile and time:
double start_time = omp_get_wtime();
serial();
double end_time = omp_get_wtime();
start_time = omp_get_wtime();
SPMD_for_reduction();
end_time = omp_get_wtime();
Using no cc compiler optimizations, the runtimes are around 4s (Serial) and .66s (omp). With the -O3 flag, serial runtime drops to ".000001s" and the omp runtime is mostly unchanged. What's going on here? Is it vector instructions being used, or is it poor code or timing method? If it's vectorization, why isn't the omp function benefiting?
It may be of interest that the machine I am using is using a modern 6 core Xeon processor.
Thanks!