I am having a problem with OMP threading in that the threads in a parallel section take very different times to execute. I am running on a Linux cluster. The code is a pure OpenMP code; e.g., no MPI code mixed in. It is written in C++. I am using gcc 4.5.3 and therefore OpenMP 3.0. I am using compile optimization level 2; i.e., -O2. I give the code first, then the data generated from it. Since I want to loop over the keys of a map, I first copy the keys into a vector, vec_keys, and the I run the parallel for loop over the elements of vec_keys. There is a for loop that is parallelized with OMP. There are 800,000 “nodes” or entries that must be processed. There are 3 “commands” within the parallel for loop.
Get map iterator based on key. See line: node = vec_keys.at(itime);
Using the iterator from step 1, get a pointer to a C++ object that is going to have a method invoked on it. See line: p_nim = p_dmb->getModel(node);
Invoke the method on the object retrieved from the map in step 2. See statement:
isStateUpdate = p_nim->computeNextState(day, iteration, fsmId, p_dmb, tid, node, p_np,p_ep,p_cp,p_fg,newStateNp, outStream);
Note that in step 2, map entries are retrieved but not written to. In step 3, a map’s contents are getting changed, but through indirection. That is, the map keys do not change. The values (in the map entries) are pointers to primitive arrays that were instantiated on the heap. Thus, by not changing the value pointer, I can change the contents of the primitive arrays. The point is that I’m working with maps, each key gets invoked once in the OMP for loop, and there are no race conditions or inconsistent memory. I have run with 1, 2, 4, and 8 threads many times, and the output is always correct. The operations in steps 1 and 2 above are the same for every map key; only step 3 can be different. The code is:
#pragma omp parallel num_threads(numSpecOmpThreads) \
private(itime,node,index,modelId,isStateUpdate,tid, \
b1time, \
e1time) \
shared(numkeys,vec_keys,p_graph,p_cp,ifsm,p_np,p_dmb, \
p_ep,p_fg,newStateNp,day,iteration,fsmId, \
fakeMpiRankOpenMp,cout,outStream, \
startWtime,endWtime,counter, \
sinnertime,einnertime, \
dt1time, \
dt2time, \
dt3time, \
countChange) \
default(none)
{
// Any variable in here is private.
tid=omp_get_thread_num();
NodeInteractModel02IF *p_nim=0;
startWtime[tid] = omp_get_wtime();
if (tid==0) {
gettimeofday(&sinnertime,0);
}
#pragma omp for nowait
for (itime=0; itime<numkeys; ++itime) {
++(counter[tid]);
// node is a tail, or owned node.
// This is step 1.
gettimeofday(&b1time,0);
node = vec_keys.at(itime);
gettimeofday(&e1time,0);
dt1time[tid] += (1000000 * (e1time.tv_sec - b1time.tv_sec) +
e1time.tv_usec - b1time.tv_usec);
// This is step 2.
gettimeofday(&b1time,0);
p_nim = p_dmb->getModel(node);
gettimeofday(&e1time,0);
dt2time[tid] += (1000000 * (e1time.tv_sec - b1time.tv_sec) +
e1time.tv_usec - b1time.tv_usec);
// This is step 3.
gettimeofday(&b1time,0);
isStateUpdate = p_nim->computeNextState(lots of args);
gettimeofday(&e1time,0);
dt3time[tid] += (1000000 * (e1time.tv_sec - b1time.tv_sec) +
e1time.tv_usec - b1time.tv_usec);
if (isStateUpdate) {
++(countChange[tid]);
}
} // End FOR over vector of owned nodes.
endWtime[tid] = omp_get_wtime();
if (tid==0) {
gettimeofday(&einnertime,0);
}
} // End pragma on OMP parallel.
Now, the problem. Take the 8-thread execution as a concrete example. An execution’s results are shown below. These are typical. dt1 is the cumulative time, in seconds, to run the first step above; dt2 is the cumulative time to run the second step above; dt3 is the cumulative time to run step 3 above. cum= dt1+dt2+dt3. countChange is the count of the number of "nodes" that change in step 3. There are 8 lines of data, one for each thread (tid=0 is the first row of data, …, tid=7 is the last row). There are 800,000 “nodes” in this run, so at most there can be 8 x 100,000 = 800,000 countChanges. I have confirmed that each thread is handling 100,000 of the total 800,000 nodes. Hence, the work—-in terms of number of nodes to process—-is the same for each thread. However, as described below, the amount of computation is not the same per thread.
+++++++++++++++++++++++++++
dt1 dt2 dt3 cum (s) countChange
0.013292 0.041117 3.10149 3.1559 15
0.009705 0.041273 3.17969 3.23067 21
0.009907 0.040998 3.29188 3.34279 16
0.009905 0.040169 3.38807 3.43814 26
0.023467 0.039489 0.198228 0.261184 100000
0.023945 0.038114 0.187334 0.249393 100000
0.023648 0.042231 0.197294 0.263173 100000
0.021285 0.046682 0.219039 0.287006 100000
dt1 is less than dt2, as expected. Both are less than dt3, as expected, because step 3 involves computations. Note, though, the problem with the dt3 values: they vary by more than an order of magnitude, and they are clumped into 2 groups: one group has dt3 ~ 3.2 and one group has dt3 ~ 0.19. Furthermore, the threads with the fastest execution are the ones that do the most work; each of the latter four threads changes all 100,000 values, while the first four threads change between 15-26 values (which are obviously orders of magnitude less than 100,000). The latter 4 threads do more work because there are more computations when a node is changed. Furthermore, the machine on which I am running is a 2-node, 4-core-per-node compute node. I’d expect that the master thread would be tid=0 and would have a lower time, if anything, but it is in the group with the greater times. Also, the single-thread code produces cum ~ 11.3 seconds. Now, 11.3/8 = 1.41 seconds.
Since the code executes this loop, and others like it, millions of times, the difference between the ideal time (1.41 s) and the largest measured times (3.44 s above) is huge, and seems excessive.
Moreover, if one runs the above example with 4 threads, instead of 8, then the first two threads have the excessive times and the latter two threads have fast times. See the following 4-thread data:
+++++++++++++++++++++++++++
dt1 dt2 dt3 cum (s) countChange
0.023794 0.073054 5.41201 5.50886 36
0.018677 0.072956 5.77536 5.86699 42
0.027368 0.066898 0.341455 0.435721 200000
0.026892 0.076005 0.363742 0.466639 200000
Again, the disparity between the first two and last two threads is an order of magnitude in time (~5.5 versus ~0.4); and again, the fastest running threads do the most work.
Here is sample 2-thread data. The second thread does more work—changes 400,000 nodes, while the first thread changes only 78 nodes—yet runs an order of magnitude faster (10.8 versus 0.8). +++++++++++++++++++++++++++
dt1 dt2 dt3 cum (s) countChange
0.025298 0.144209 10.6269 10.7964 78
0.019307 0.126661 0.619432 0.7654 400000
I have repeated this experiment numerous times using OpenMP alone, and on a combined OpenMP + MPI code, and I get the same results every single time (of course, the values are tweeked a bit, but the same trends). The first one-half of the threads (those with the smallest tid's) run the longest and do less work. Also, using gcc 4.7.0, and hence OpenMP 3.1 gives the same results.
I’d be very grateful for any help as to why these threads have such wide differences in execution times, and what I can do to eliminate it.
computeNextState
calculation is memory bound, and the data it needs is mostly local to one of the two nodes ,then an order of magnitude in timing difference doesn't strike me as at all implausible. But as @Peter points out above, it is highly suspicious that one node's threads would always mark every item as changed, while those on the other only do the same for a total of 0.02% of them. – Jonathan Dursi