
I'm learning openmp using the example of computing the value of pi via quadature. In serial, I run the following C code:

double serial() {
    double step;
    double x,pi,sum = 0.0;

    step = 1.0 / (double) num_steps;

    for (int i = 0; i < num_steps; i++) {
        x = (i + 0.5) * step; // forward quadature
        sum += 4.0 / (1.0 + x*x);
    pi = step * sum;

    return pi;

I'm comparing this to an omp implementation using a parallel for with reduction:

double SPMD_for_reduction() {
    double step;
    double pi,sum = 0.0;

    step = 1.0 / (double) num_steps;

    #pragma omp parallel for reduction (+:sum)
    for (int i = 0; i < num_steps; i++) {
        double x = (i + 0.5) * step;
        sum += 4.0 / (1.0 + x*x);
    pi = step * sum;

    return pi;

For num_steps = 1,000,000,000, and 6 threads in the case of omp, I compile and time:

    double start_time = omp_get_wtime();
    double end_time = omp_get_wtime();

    start_time = omp_get_wtime();
    end_time = omp_get_wtime();

Using no cc compiler optimizations, the runtimes are around 4s (Serial) and .66s (omp). With the -O3 flag, serial runtime drops to ".000001s" and the omp runtime is mostly unchanged. What's going on here? Is it vector instructions being used, or is it poor code or timing method? If it's vectorization, why isn't the omp function benefiting?

It may be of interest that the machine I am using is using a modern 6 core Xeon processor.



1 Answers


The compiler outsmarts you. For the serial version it is able to detect, that the result of your computation is never used. Therefore it throws out the computation completely.

double start_time = omp_get_wtime();
serial(); //<-- Computations not used.
double end_time = omp_get_wtime();

In the openMP case the compiler can not see if really everything inside the function body is without an effect, so to stay on the safe side it keeps the function call.

You can of course write something like double serial_pi = serial(); and outside of the time measurement do some dummy stuff with the variable serial_pi. This way the compiler will keep the function call and do the optimizations you are actually looking for.