OpenMP and GSL RNG - Performance Issue - 4 threads implementation 10x slower than pure sequential one (quadcore CPU)

Question

I am trying to turn a C project of mine from sequential into parallel programming. Although most of the code has now been redesigned from scratch for this purpose, the generation of random numbers is still at its core. Thus, bad performance of the random number generator (RNG) affects very badly the overall performance of the program.

I wrote some code lines (see below) to show the problem I am facing without much verbosity.

The problem is the following: everytime the number of threads nt increases, the performance gets singnificantly worse. At this workstation (linux kernel 2.6.33.4; gcc 4.4.4; intel quadcore CPU) the parallel for-loop takes roughly 10x longer to finish with nt=4 than with nt=1, regardless of the number of iterates n.

This situation seems to be described here but the focus is mainly in fortran, a language I know very little about, so I would very much appreciate some help.

I tried to follow their idea of creating different RNG (with a different seed) to be accessed by each thread but the performance is still very bad. Actually, this different seeding point for each thread bugs me as well, because I cannot see how it is possible for one to guarantee the quality of the generated numbers in the end (lack of correlations, etc).

I have already thought of dropping GSL altogether and implementing a random generator algorithm (such as Mersenne-Twister) myself but I suspect I would just bump into the same issue later on.

Thank you very much in advance for your answers and advice. Please do ask anything important I may have forgotten to mention.

EDIT: Implemented corrections suggested by lucas1024 (pragma for-loop declaration) and JonathanDursi (seeding; setting "a" as a private variable). Performance is still very sluggish in multithread-mode.

EDIT 2: Implemented solution suggested by Jonathan Dursi (see comments).

    #include <stdio.h>
    #include <stdlib.h>
    #include <time.h>
    #include <gsl/gsl_rng.h>
    #include <omp.h>

    double d_t (struct timespec t1, struct timespec t2){

        return (t2.tv_sec-t1.tv_sec)+(double)(t2.tv_nsec-t1.tv_nsec)/1000000000.0;
    }

    int main (int argc, char *argv[]){

        double a, b;

        int i,j,k;

        int n=atoi(argv[1]), seed=atoi(argv[2]), nt=atoi(argv[3]);

        printf("\nn\t= %d", n);
        printf("\nseed\t= %d", seed);
        printf("\nnt\t= %d", nt);

        struct timespec t1, t2, t3, t4;

        clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &t1);

        //initialize gsl random number generator
        const gsl_rng_type *rng_t;
        gsl_rng **rng;
        gsl_rng_env_setup();
        rng_t = gsl_rng_default;

        rng = (gsl_rng **) malloc(nt * sizeof(gsl_rng *));

            #pragma omp parallel for num_threads(nt)
        for(i=0;i<nt;i++){
            rng[i] = gsl_rng_alloc (rng_t);
            gsl_rng_set(rng[i],seed*i);
        }

        clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &t2);

        for (i=0;i<n;i++){
            a = gsl_rng_uniform(rng[0]);
        }

        clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &t3);

        omp_set_num_threads(nt);
        #pragma omp parallel private(j,a)
        {
            j = omp_get_thread_num();
            #pragma omp for
            for(i=0;i<n;i++){
                a = gsl_rng_uniform(rng[j]);
            }
        }

        clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &t4);

        printf("\n\ninitializing:\t\tt1 = %f seconds", d_t(t1,t2));
        printf("\nsequencial for loop:\tt2 = %f seconds", d_t(t2,t3));
        printf("\nparalel for loop:\tt3 = %f seconds (%f * t2)", d_t(t3,t4), (double)d_t(t3,t4)/(double)d_t(t2,t3));
        printf("\nnumber of threads:\tnt = %d\n", nt);

        //free random number generator
        for (i=0;i<nt;i++)
            gsl_rng_free(rng[i]);
        free(rng);

        return 0;

    }

I don't know how your random number generator works, but if it relies on the system entropy pool, that may be running dry. If so check out entropykey.co.uk — blueshift
I am sorry blueshift but I do not think the problem I described is even remotely related to what you pointed out. — dd_rlwll
No problem! Checking /proc/sys/kernel/random/entropy_avail may confirm one way or the other. — blueshift
One issue is that in the parallel loop, all threads are writing to a single shared variable a; that should be made private. Also, you're setting all your seeds to seed (not a performance issue, admittedly); it should be seedi, not seedomp_get_thread_num() = seed*1 outside of a parallel region. You can also use more compute-intensive RNGs like gsl_rng_mt19937. But there's still a factor of 10-20 performance drop here which doesn't go down with N, and I've got to admit I don't see it. I thought it might be false-sharing with the rng array but spacing things out more it's still there. — Jonathan Dursi
@JonathanDursi, it is true indeed that the seeding was wrong. Actually, and regarding one of my original questions, how can I guarantee the quality of the generated numbers if I seed the different RNG this way (iseed)? Or any other way (e.g.: iseed*3141592)? I would say that an overlap, which seems to be (very) possible in this situation, would hugely affect correlations. I agree that it might be useful to use a more compute-intensive RNG and I have just tested it with gsl_rng_mt19937. Unfortunately, the drop in performance is still huge. — dd_rlwll

Lubo Antonov Lubo Antonov · Accepted Answer · 2012-03-29T00:11:08

The problem is in the second #pragma omp line. The first #pragma omp spawns 4 threads. After that you are supposed to simply say #pragma omp for - not #pragma omp parallel for.

With the current code, depending on your omp nesting settings, you are creating 4 x 4 threads that are doing the same work and accessing the same data.

OpenMP and GSL RNG - Performance Issue - 4 threads implementation 10x slower than pure sequential one (quadcore CPU)

1 Answers