C++ and OpenMP--vastly different execution times for threads

Question

I am having a problem with OMP threading in that the threads in a parallel section take very different times to execute. I am running on a Linux cluster. The code is a pure OpenMP code; e.g., no MPI code mixed in. It is written in C++. I am using gcc 4.5.3 and therefore OpenMP 3.0. I am using compile optimization level 2; i.e., -O2. I give the code first, then the data generated from it. Since I want to loop over the keys of a map, I first copy the keys into a vector, vec_keys, and the I run the parallel for loop over the elements of vec_keys. There is a for loop that is parallelized with OMP. There are 800,000 “nodes” or entries that must be processed. There are 3 “commands” within the parallel for loop.

Get map iterator based on key. See line: node = vec_keys.at(itime);
Using the iterator from step 1, get a pointer to a C++ object that is going to have a method invoked on it. See line: p_nim = p_dmb->getModel(node);
Invoke the method on the object retrieved from the map in step 2. See statement:
isStateUpdate = p_nim->computeNextState(day, iteration, fsmId, p_dmb, tid, node, p_np,p_ep,p_cp,p_fg,newStateNp, outStream);

Note that in step 2, map entries are retrieved but not written to. In step 3, a map’s contents are getting changed, but through indirection. That is, the map keys do not change. The values (in the map entries) are pointers to primitive arrays that were instantiated on the heap. Thus, by not changing the value pointer, I can change the contents of the primitive arrays. The point is that I’m working with maps, each key gets invoked once in the OMP for loop, and there are no race conditions or inconsistent memory. I have run with 1, 2, 4, and 8 threads many times, and the output is always correct. The operations in steps 1 and 2 above are the same for every map key; only step 3 can be different. The code is:

       #pragma omp parallel num_threads(numSpecOmpThreads) \
                private(itime,node,index,modelId,isStateUpdate,tid, \
                        b1time, \
                        e1time) \
                shared(numkeys,vec_keys,p_graph,p_cp,ifsm,p_np,p_dmb, \
                       p_ep,p_fg,newStateNp,day,iteration,fsmId, \
                       fakeMpiRankOpenMp,cout,outStream, \
                       startWtime,endWtime,counter, \
                       sinnertime,einnertime, \
                       dt1time, \
                       dt2time, \
                       dt3time, \
                       countChange) \
                default(none)
       {
           // Any variable in here is private.
           tid=omp_get_thread_num();
           NodeInteractModel02IF *p_nim=0;
           startWtime[tid] = omp_get_wtime();
           if (tid==0) {
               gettimeofday(&sinnertime,0);
           }

           #pragma omp for nowait
           for (itime=0; itime<numkeys; ++itime) {

               ++(counter[tid]);

               // node is a tail, or owned node.
               // This is step 1.
               gettimeofday(&b1time,0);
               node = vec_keys.at(itime);
               gettimeofday(&e1time,0);

               dt1time[tid] += (1000000  * (e1time.tv_sec - b1time.tv_sec) +
                        e1time.tv_usec - b1time.tv_usec);

               // This is step 2.
               gettimeofday(&b1time,0);
               p_nim = p_dmb->getModel(node);
               gettimeofday(&e1time,0);

               dt2time[tid] += (1000000  * (e1time.tv_sec - b1time.tv_sec) +
                        e1time.tv_usec - b1time.tv_usec);

               // This is step 3.
               gettimeofday(&b1time,0);
               isStateUpdate =  p_nim->computeNextState(lots of args);
               gettimeofday(&e1time,0);

               dt3time[tid] += (1000000  * (e1time.tv_sec - b1time.tv_sec) +
                        e1time.tv_usec - b1time.tv_usec);

               if (isStateUpdate) {
                   ++(countChange[tid]);
               }


           }  // End FOR over vector of owned nodes.


           endWtime[tid] = omp_get_wtime();

           if (tid==0) {
               gettimeofday(&einnertime,0);
           }

       }  // End pragma on OMP parallel.

Now, the problem. Take the 8-thread execution as a concrete example. An execution’s results are shown below. These are typical. dt1 is the cumulative time, in seconds, to run the first step above; dt2 is the cumulative time to run the second step above; dt3 is the cumulative time to run step 3 above. cum= dt1+dt2+dt3. countChange is the count of the number of "nodes" that change in step 3. There are 8 lines of data, one for each thread (tid=0 is the first row of data, …, tid=7 is the last row). There are 800,000 “nodes” in this run, so at most there can be 8 x 100,000 = 800,000 countChanges. I have confirmed that each thread is handling 100,000 of the total 800,000 nodes. Hence, the work—-in terms of number of nodes to process—-is the same for each thread. However, as described below, the amount of computation is not the same per thread.

+++++++++++++++++++++++++++

dt1 dt2 dt3 cum (s) countChange

0.013292 0.041117 3.10149 3.1559 15

0.009705 0.041273 3.17969 3.23067 21

0.009907 0.040998 3.29188 3.34279 16

0.009905 0.040169 3.38807 3.43814 26

0.023467 0.039489 0.198228 0.261184 100000

0.023945 0.038114 0.187334 0.249393 100000

0.023648 0.042231 0.197294 0.263173 100000

0.021285 0.046682 0.219039 0.287006 100000

dt1 is less than dt2, as expected. Both are less than dt3, as expected, because step 3 involves computations. Note, though, the problem with the dt3 values: they vary by more than an order of magnitude, and they are clumped into 2 groups: one group has dt3 ~ 3.2 and one group has dt3 ~ 0.19. Furthermore, the threads with the fastest execution are the ones that do the most work; each of the latter four threads changes all 100,000 values, while the first four threads change between 15-26 values (which are obviously orders of magnitude less than 100,000). The latter 4 threads do more work because there are more computations when a node is changed. Furthermore, the machine on which I am running is a 2-node, 4-core-per-node compute node. I’d expect that the master thread would be tid=0 and would have a lower time, if anything, but it is in the group with the greater times. Also, the single-thread code produces cum ~ 11.3 seconds. Now, 11.3/8 = 1.41 seconds.

Since the code executes this loop, and others like it, millions of times, the difference between the ideal time (1.41 s) and the largest measured times (3.44 s above) is huge, and seems excessive.

Moreover, if one runs the above example with 4 threads, instead of 8, then the first two threads have the excessive times and the latter two threads have fast times. See the following 4-thread data:

+++++++++++++++++++++++++++

dt1 dt2 dt3 cum (s) countChange

0.023794 0.073054 5.41201 5.50886 36

0.018677 0.072956 5.77536 5.86699 42

0.027368 0.066898 0.341455 0.435721 200000

0.026892 0.076005 0.363742 0.466639 200000

Again, the disparity between the first two and last two threads is an order of magnitude in time (~5.5 versus ~0.4); and again, the fastest running threads do the most work.

Here is sample 2-thread data. The second thread does more work—changes 400,000 nodes, while the first thread changes only 78 nodes—yet runs an order of magnitude faster (10.8 versus 0.8). +++++++++++++++++++++++++++

dt1 dt2 dt3 cum (s) countChange

0.025298 0.144209 10.6269 10.7964 78

0.019307 0.126661 0.619432 0.7654 400000

I have repeated this experiment numerous times using OpenMP alone, and on a combined OpenMP + MPI code, and I get the same results every single time (of course, the values are tweeked a bit, but the same trends). The first one-half of the threads (those with the smallest tid's) run the longest and do less work. Also, using gcc 4.7.0, and hence OpenMP 3.1 gives the same results.

I’d be very grateful for any help as to why these threads have such wide differences in execution times, and what I can do to eliminate it.

There's clearly a correlation between runtime and number of nodes changed, except that it's in the opposite direction you expect it to be. Could it be as simple as even/odd static assignment? (odds change every node but are fast, and evens change only a few nodes but are slow). Try without threading at all, and store your stats based on (iterator%2). My crystal ball says the logic in the odd iterator is broken somehow, hits an early exit, and always says the node has been changed. — Peter
When you run it on 4 and 2 threads, can I assume that it still runs on both nodes? If the computeNextState calculation is memory bound, and the data it needs is mostly local to one of the two nodes ,then an order of magnitude in timing difference doesn't strike me as at all implausible. But as @Peter points out above, it is highly suspicious that one node's threads would always mark every item as changed, while those on the other only do the same for a total of 0.02% of them. — Jonathan Dursi
Guys, Ladies: The problem appears to be bad data. I need to do more checks and confirm, but the data file was definitely not what I expected, and prelim. results on "correct" data look good. I'm very grateful for everyone's insights. — Bib01

Grizzly Grizzly · Accepted Answer · 2012-08-20T20:46:14

First of all are you really sure that the threads taking longer do less work? Because it is kind of odd that exactly those threads which work on only a few items always take the longest time. If that is the case you could try considering the following factors:

Is there false sharing (several threads accessing the same cacheline, at least some of them writing)?
You mentioned it working on a 2-node machine. You might try looking into which node executes which threads and where the memor beeing worked on is allocated. Unless the interconnect between the nodes is really crappy I doubt that this is the issue though.

While not an answer as to why some threads are slower, you might want to try guided or dynamic scheduling for the loop (e.g. #pragma omp for nowait schedule(dynamic, 10000), of course you'd want to finetune the chunk_size for maximum performance), to distribute the workload more evenly across the threads by making faster threads take more load.

As a side note: Why exactly do you need all those private variables, considering that c++ allows variable declarations inside any structured block and variables declared inside the parallel sections are private to the thread anyways. So declaring the variable at the point of first use inside the parallel section is probably a good idea for readability and maybe even performance (although not particulary likely in this case).

C++ and OpenMP--vastly different execution times for threads

1 Answers