7
votes

Good day everyone!

I'm conducting a molecular dynamics simulation, and recently I began to try to implement it in parallel. At first sight everything looked simple enough: write #pragma omp parallel for directive in front of the most time consuming loops. But as it happens, functions in those loops operate on arrays, or, to be precise, on arrays which belong to an object of my class that contains all information about particle system and functions acing on this system, so that when I added that #pragma directive before one of the most time consuming loops, the computation time actually increased several times despite the fact that my 2 core 4 thread processor was fully loaded.

In order to sort this out I wrote another, simpler program. This test program performs two identical loops, one in parallel, and the second one - in serial. The time it takes to execute both of these loops is measured. The results surprised me: whenever the first loop was computed in parallel, its computation time decreased in comparison with serial mode (1500 and 6000 ms respectively), but the computation time of the second loop increased drastically (15 000 against 6000 in serial).

I tried to use private() and firstprivate() clauses, but the results were the same. Shouldn't every variable defined and initialized before parallel region be shared automatically anyway? The computation time of the second loop gets back to normal if performed on another vector: vec2, but creating a new vector for every iteration is, clearly, not an option. I've also tried to put actual update of vec1 into #pragma omp critical area, but that wasn't any good either. Neither helped adding Shared(vec1) clause.

I would appreciate if you could point out my errors and show the proper way.

Is it necessary to put that private(i) into the code?

Here is this test program:

#include "stdafx.h"
#include <omp.h>
#include <array>
#include <time.h>
#include <vector>
#include <iostream>
#include <Windows.h>
using namespace std;
#define N1  1000
#define N2  4000
#define dim 1000

int main(){
    vector<int>res1,res2;
    vector<double>vec1(dim),vec2(N1);
    clock_t t, tt;
    int k=0;
    for( k = 0; k<dim; k++){
        vec1[k]=1;
    }

    t = clock();

    #pragma omp parallel 
        {
        double temp; 
        int i,j,k;
        #pragma omp for private(i)
            for( i = 0; i<N1; i++){
                for(j = 0; j<N2; j++){  
                    for( k = 0; k<dim; k++){
                        temp+= j;
                    }
                }
                vec1[i]+=temp;
                temp = 0;
            }
        }
    tt = clock();
    cout<<tt-t<<endl;
    for(int k = 0; k<dim; k++){
        vec1[k]=1;
    }
    t = clock();
                for(int g = 0; g<N1; g++){
        for(int h = 0; h<N2; h++){
            for(int y = 0; y<dim; y++){
                vec1[g]+=h; 
            }
        }
    }
    tt = clock();
    cout<<tt-t<<endl;
    getchar();
}

Thank you for your time!

P.S. I use visual studio 2012, My processor is Intel Core i3-2370M. My assembly file in two parts:

http://pastebin.com/suXn35xj

http://pastebin.com/EJAVabhF

1
You don't need to make i private, as the loop variable you're #pragma omp foring is made private automatically, and at any rate is defined in the parallel section so will be private to each thread -- that is also true of j,k,and temp. Note though that your temp is undefined when you start adding to it; you should initialize it to zero. As to the other question - can you confirm what you mean is that the serial portion of the program runs more slowly if you run the parallel portion with multiple threads? If so, that's probably a memory-affinity thing.Jonathan Dursi
Thank you for replying. Yes, that is the main problem: the serial portion of the program takes 15 000 ms to complete whenever the first part is run in parallel mode. Normally it takes about 6000 ms in serial. The only way I was able to get rid of this phenomenon was to perform the operations in the second part of the program on an entirely different vector (vec2 in my implementation).Denis
What are the values of N1, N2 and dim? What compiler do you use? (I would assume Visual Studio or some other Windows IDE given that getchar() at the end of your program...) What kind of CPU do you have (we already know that it is dual-core but what exactly)?Hristo Iliev
@JonathanDursi, he seems to be running on a single-socket system with hyperthreaded CPU, so I wouldn't blame memory affinity for the slowdown.Hristo Iliev
Could you please go to Configuration Properties -> C/C++ -> Output Files and set Assembler Output to Assembly With Source Code. Then rebuild the project and look into the Release directory for files that have .asm ending. Then paste their content to pastebin and add the URLs to your question. Do that with OpenMP enabled and with OpenMP disabled (please mark clearly which file is from which case in the pastes). May be we should continue this discussion in the chat.Hristo Iliev

1 Answers

9
votes

Congratulations! You have exposed yet another bad OpenMP implementation, courtesy of Microsoft. My initial theory was that the problem comes from the partitioned L3 cache in Sandy Bridge and later Intel CPUs. But the result from running the second loop only on the first half of the vector did not confirm that theory. Then it has to be something in the code generator that is triggered when OpenMP is enabled. The assembly output confirms this.

Basically the compiler does not optimise the serial loop when compiling with OpenMP enabled. That's where the slowdown comes from. Part of the problem was also introduced by yourself by making the second loop not identical to the first one. In the first loop you accumulate intermediate values into a temporary variable, which the compiler optimises to register variable, while in the second case you invoke operator[] on each iteration. When you compile without OpenMP enabled, the code optimiser transforms the second loop into something which is quite similar to the first loop, hence you get almost the same run time for both loops.

When you enable OpenMP, the code optimiser does not optimise the second loop and it runs way slower. The fact that your code executes a parallel block before that has nothing to do with the slowdown. My guess is that the code optimiser is unable to grasp the fact that vec1 is outside of the scope of the OpenMP parallel region and hence it should no longer be treated as shared variable and the loop can be optimised. Obviously this is a "feature", which was introduced in Visual Studio 2012, since the code generator in Visual Studio 2010 is able to optimise the second loop even with OpenMP enabled.

One possible solution would be to migrate to Visual Studio 2010. Another (hypothetical, since I don't have VS2012) solution would be to extract the second loop into a function and to pass the vector by reference to it. Hopefully the compiler would be smart enough to optimise the code in the separate function.

This is a very bad trend. Microsoft have practically given up on supporting OpenMP in Visual C++. Their implementation still (almost) conforms to OpenMP 2.0 only (hence no explicit tasks and other OpenMP 3.0+ goodies) and bugs like this one do not make things any better. I would recommend that you switch to another OpenMP enabled compiler (Intel C/C++ Compiler, GCC, anything non-Microsoft) or switch to some other compiler independent threading paradigm, for example Intel Threading Building Blocks. Microsoft is clearly pushing their parallel library for .NET and that's where all the development goes.


Big Fat Warning

Do not use clock() to measure the elapsed wall-clock time! This only works as expected on Windows. On most Unix systems (including Linux) clock() actually returns the total consumed CPU time by all threads in the process since it was created. This means that clock() may return values which are either several times larger than the elapsed wall-clock time (if the program runs with many busy threads) or several times shorter that the wall-clock time (if the program sleeps or waits on IO events between the measurements). Instead, in OpenMP programs, the portable timer function omp_get_wtime() should be used.