0
votes

I have the following code that uses OMP to parallelize a monte carlo method. My question is why does the serial version of the code (monte_carlo_serial) run a lot faster than the parallel version (monte_carlo_parallel). I am running the code on a machine with 32 cores and get the following result printed to the console:

-bash-4.1$ gcc -fopenmp hello.c ;
-bash-4.1$ ./a.out
Pi (Serial): 3.140856
Time taken 0 seconds 50 milliseconds
Pi (Parallel): 3.132103
Time taken 127 seconds 990 milliseconds

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <omp.h>
#include <time.h>

int niter = 1000000;            //number of iterations per FOR loop

int monte_carlo_parallel() {
  double x,y;                     //x,y value for the random coordinate
  int i;                          //loop counter
  int count=0;                //Count holds all the number of how many good coordinates
  double z;                       //Used to check if x^2+y^2<=1
  double pi;                      //holds approx value of pi
  int numthreads = 32;

#pragma omp parallel firstprivate(x, y, z, i) reduction(+:count) num_threads(numthreads)
  {
    srand48((int)time(NULL) ^ omp_get_thread_num());    //Give random() a seed value
    for (i=0; i<niter; ++i)                 //main loop
      {
        x = (double)drand48();              //gets a random x coordinate
        y = (double)drand48();              //gets a random y coordinate
        z = ((x*x)+(y*y));              //Checks to see if number is inside unit circle
        if (z<=1)
          {
            ++count;                //if it is, consider it a valid random point
          }
      }
  }

  pi = ((double)count/(double)(niter*numthreads))*4.0;
  printf("Pi (Parallel): %f\n", pi);
  return 0;
}

int monte_carlo_serial(){
  double x,y;                     //x,y value for the random coordinate
  int i;                          //loop counter
  int count=0;                //Count holds all the number of how many good coordinates
  double z;                       //Used to check if x^2+y^2<=1
  double pi;                      //holds approx value of pi

  srand48((int)time(NULL) ^ omp_get_thread_num());  //Give random() a seed value

  for (i=0; i<niter; ++i)                   //main loop
    {
      x = (double)drand48();                //gets a random x coordinate
      y = (double)drand48();                //gets a random y coordinate
      z = ((x*x)+(y*y));                //Checks to see if number is inside unit circle
      if (z<=1)
        {
          ++count;              //if it is, consider it a valid random point
        }
    }

  pi = ((double)count/(double)(niter))*4.0;
  printf("Pi (Serial): %f\n", pi);

  return 0;
}


void main(){
  clock_t start = clock(), diff;

  monte_carlo_serial();

  diff = clock() - start;
  int msec = diff * 1000 / CLOCKS_PER_SEC;
  printf("Time taken %d seconds %d milliseconds \n", msec/1000, msec%1000);



  start = clock(), diff;

  monte_carlo_parallel();

  diff = clock() - start;
  msec = diff * 1000 / CLOCKS_PER_SEC;
  printf("Time taken %d seconds %d milliseconds \n", msec/1000, msec%1000);

}
1
As the related references tell you, your tally of total time taken by all threads is expected to increase with number of threads. There appears no point in setting x,y as first private rather than simply defining with local scope, particularly as most of the time is spent in serialized drand.tim18
You should realize that: 1/ drand48() isn't thread-safe as it uses a global state (look at drand48_r() for possible replacement if you want to stick to this RNG); and 2/ clock() gives you CPU time, not elapsed time... You should use omp_get_wtime() for all you timing tasks here. Finally, your issue has nothing to do with false sharing of count.Gilles

1 Answers

-3
votes

The variable

count 

is shared across all of your spawned threads. Each of them has to lock count to increment it. In addition if the threads are running on separate cpu's (and there's no possible win if they're not) you have the cost of sending the value of count from one core to another and back again.

This is a textbook example of false sharing. Accessing count in your serial version it will be in a register and cost 1 cycle to increment. In the parallel version it will usually not be in cache, you have to tell the other cores to invalidate that cache line, fetch it (L3 is going to take 66 cycles at best) increment it, and store it back. Every time count migrates from one cpu core to another you have a minimum ~125 cycle cost which is a lot worse than 1. The threads will never be able to run in parallel because they depend on count.

Try to modify your code so that each thread has its own count, then sum all values of count from all the threads at the end and you /might/ see a speedup.