20
votes

Firstly, I know this [type of] question is frequently asked, so let me preface this by saying I've read as much as I can, and I still don't know what the deal is.

I've parallelized a massive outer for loop. Number of loop iterations varies, typically between 20-150, but the loop body does a huge amount of work, calling on a lot of local intensive linear algebra routines (as in, the code is part of the source and not an external dependency). Within the loop body there is 1000+ calls to these routines, but they're all totally independent of one another, so I figured it would be a prime candidate for parallelism. The loop code is C++, but it calls a lot of subroutines written in C.

Code looks like this;

<declare and initialize shared variables here>
#ifdef _OPENMP
#pragma omp parallel for                            \
  private(....)\
  shared(....)              \
  firstprivate(....) schedule(runtime)
#endif
  for(tst = 0; tst < ntest; tst++) {

     // Lots of functionality (science!)
     // Calls to other deep functions which manipulate private variables only
     // Call to function which has 1000 loop iterations doing matrix manipulation
     // With no exaggeration, there are probably millions 
     // of for-loop iterations in this body, in the various functions called. 
     // They also do lots of mallocing and freeing
     // Finally generated some calculated_values

     shared_array1[tst] = calculated_value1;
     shared_array2[tst] = calculated_value2;
     shared_array3[tst] = calculated_value3;

 } // end of parallel and for

// final tidy up

There shouldn't, I believe, be any synchronization at all - the only time the threads access a shared variable are the shared_arrays, and they access unique points in those arrays, indexed by tst.

Thing is, when I up the number of threads (on a multicore cluster!) The speeds we're seeing (where we invoke this loop 5 times) are as follows;

              Elapsed time   System time
 Serial:        188.149          1.031
 2 thrds:       148.542          6.788
 4 thrds:       309.586        424.037       # SAY WHAT?
 8 thrds:       230.290        568.166  
16 thrds:       219.133        799.780 

Things which may be noticeable are the massive jump in System time between 2 and 4 threads, and the fact the elapsed time doubles as we move from 2 to 4, and then slowly decreases.

I've tried with a huge range of OMP_SCHEDULE parameters but no luck. Is this related to the fact each thread is using malloc/new and free/delete a lot? This has consistently been run with 8GBs memory - but I'm guessing that's not a problem. Frankly, the huge rise in system time makes it look like the threads might be blocking, but I have no idea why that would happen.

UPDATE 1 I really thought the false sharing was going to be the problem, so re-wrote the code so that the loops store their calculated values in thread-local arrays, and then copy these arrays across to the shared array at the end. Sadly this didn't have any impact, though I almost don't believe it myself.

Following @cmeerw's advice, I ran strace -f, and after all the initialization there are just millions of lines of

[pid 58067] futex(0x35ca58bb40, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 58066] <... futex resumed> )       = -1 EAGAIN (Resource temporarily unavailable)
[pid 58065] <... futex resumed> )       = -1 EAGAIN (Resource temporarily unavailable)
[pid 57684] <... futex resumed> )       = 0
[pid 58067] <... futex resumed> )       = 0
[pid 58066] futex(0x35ca58bb40, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 58065] futex(0x35ca58bb40, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 58067] futex(0x35ca58bb40, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 58066] <... futex resumed> )       = 0
[pid 57684] futex(0x35ca58bb40, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 58065] <... futex resumed> )       = 0
[pid 58067] <... futex resumed> )       = 0
[pid 57684] <... futex resumed> )       = -1 EAGAIN (Resource temporarily unavailable)
[pid 58066] futex(0x35ca58bb40, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 58065] futex(0x35ca58bb40, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 58066] <... futex resumed> )       = -1 EAGAIN (Resource temporarily unavailable)
[pid 57684] futex(0x35ca58bb40, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 58065] <... futex resumed> )       = 0
[pid 58066] futex(0x35ca58bb40, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 57684] <... futex resumed> )       = 0
[pid 58067] futex(0x35ca58bb40, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 58066] <... futex resumed> )       = 0
[pid 58065] futex(0x35ca58bb40, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 58067] <... futex resumed> )       = -1 EAGAIN (Resource temporarily unavailable)
[pid 58066] futex(0x35ca58bb40, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 57684] futex(0x35ca58bb40, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 58065] <... futex resumed> )       = 0
[pid 58067] futex(0x35ca58bb40, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 58066] <... futex resumed> )       = -1 EAGAIN (Resource temporarily unavailable)
[pid 57684] <... futex resumed> )       = 0
[pid 58067] <... futex resumed> )       = 0
[pid 58066] futex(0x35ca58bb40, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 58065] futex(0x35ca58bb40, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 58066] <... futex resumed> )       = 0
[pid 58065] <... futex resumed> )       = -1 EAGAIN (Resource temporarily unavailable)
[pid 58066] futex(0x35ca58bb40, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 57684] futex(0x35ca58bb40, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 58067] futex(0x35ca58bb40, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 58066] <... futex resumed> )       = -1 EAGAIN (Resource temporarily unavailable)
[pid 58065] futex(0x35ca58bb40, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 57684] <... futex resumed> )       = 0

Anyone have any ideas what means? Looks like the threads are context switching way too often, or just blocking and unblocking? When I strace the same implementation with OMP_NUM_THREADS set to 0 I get none of this at all. For some comparison, the logfile generated when 1 thread is used is 486 KB , and the log file generated when 4 threads are used is 266 MB.

In other words, the parallel version invokes an extra 4170104 lines of log file...

UPDATE 2

As suggested by Tom, I tried binding threads to specific processors to no avail. We're in OpenMP 3.1, so I set the environment variable using export OMP_PROC_BIND=true. Same size logfile and same timeframe.

UPDATE 3

The plot thickens. Having only profiled on the cluster so far, I installed GNU GCC 4.7 via Macports and compiled (with openMP) on my Macbook for the first time (Apple's GCC-4.2.1 throws up a compiler bug when OpenMP is enabled, which is why I hadn't compiled and run it in parallel locally until now). On the Macbook, you see basically the trend you'd expect

                C-code time
 Serial:         ~34 seconds
 2 thrds:        ~21 seconds
 4 thrds:        ~14 seconds
 8 thrds:        ~12 seconds
16 thrds:         ~9 seconds

We see dimishing returns towards the ends, though this is hardly surprising as a couple of the data sets we're iterating over on this test data have <16 members (so, we're generating 16 threads for, say a for-loop with 7 iterations).

So, now the question remains - WHY does the cluster's performance degrade so badly. I'm going to try on a different quadcore linuxbox tonight. The cluster compiles with GNU-GCC 4.6.3, but I can't believe that in itself is going to make such a difference?

Neither ltrace nor GDB are installed on the cluster (and I can't get them on for various reasons). If my linuxbox gives cluster-like performance I'll run the corresponding ltrace analysis there.

UPDATE 4

Oh my. I duel booted my Macbook Pro into Ubuntu (12.04) and re-ran the code. It all runs (which is somewhat reassuring) but I see the same, weird bad-performance behavior I see on the clusters, and the same run of millions of futex calls. Given the only difference between my local machine in Ubuntu and in OSX is software (and I'm using the same compiler and libraries - presumably there aren't different glibc implementations for OSX and Ubuntu!) I'm now wondering if this is something to do with how Linux schedules/distributes threads. Any case, being on my local machine makes everything a million times easier, so I'm going to go ahead and ltrace -f it and see what I can find. I wrote a work around for the clusters which forks() off a separate process, and gives a perfect 1/2 in the runtime, so it's definitely possible to get the parallelism going...

3
Are you doing any random number generation in those functions? I had a similar problem: glibc does synchronization on each call to rand. Solution for OpenMP: stackoverflow.com/questions/8980056/…Alexander Kondratskiy
Have you tried running under "strace -f" yet? That should at least show you which system calls are involved...cmeerw
Could you see what happens if you explicitly set thread affinity masks?toniedzwiedz
So the strace output means that there is heavy lock contention. The next thing to do is either look at the output of running via "ltrace -f" or use gdb and look at the callstack of each thread at random points (which should provide some clue of what is responsible for this excessive locking).cmeerw
What is your cluster hardware configuration? What CPUs does it have - is it by any chance a multi-socket AMD64 or Nehalem or newer Intel system? Synchronisation between threads on different CPUs and/or NUMA nodes is much more expensive than for threads running on cores with shared cache. You might be also hitting some other bottleneck, e.g. memory bandwidth limit ("... calling on a lot of local intensive linear algebra routines ..." hints on it)Hristo Iliev

3 Answers

8
votes

So after some fairly extensive profiling (thanks to this great post for info on gprof and time sampling with gdb) which involved writing a big wrapper function to generate production level code for profiling, it became obvious that for the vast majority of the time when I aborted the running code with gdb and ran backtrace the stack was in an STL <vector> call, manipulating a vector in some way.

The code passes a few vectors into the parallel section as private variables, which seemed to work fine. However, after pulling out all the vectors and replacing them with arrays (and some other jiggery-pokery to make that work) I saw a significant speed up. With small, artificial data sets the speed up is near perfect (i.e. as you double number of threads you half the time), while with real data sets the speed up isn't quite as good, but this makes complete sense as in the context of how the code works.

It seems that for whatever reason (maybe some static or global variables deep in the STL<vector> implementation?) when there are loops moving through hundreds of thousands of iterations in parallel there is some deep level locking, which occurs in Linux (Ubuntu 12.01 and CentOS 6.2) but not in OSX.

I'm really intrigued as to why I see this difference. Could it be difference in how the STL is implemented (OSX version was compiled under GNU GCC 4.7, as were the Linux ones), or is this to do with context switching (as suggested by Arne Babenhauserheide)

In summary, my debugging process was as followed;

  • Initial profiling from within R to identify the issue

  • Ensured there were no static variables acting as shared variables

  • Profiled with strace -f and ltrace -f which was really helpful in identifying locking as the culprit

  • Profiled with valgrind to look for any errors

  • Tried a variety of combinations for the schedule type (auto, guided, static, dynamic) and chunk size.

  • Tried binding threads to specific processors

  • Avoided false sharing by creating thread-local buffers for values, and then implement a single synchronization event at the end of the for-loop

  • Removed all the mallocing and freeing from within the parallel region - didn't help with the issue but did provide a small general speedup

  • Tried on various architectures and OSses - didn't really help in the end, but did show that this was a Linux vs. OSX issue and not a supercomputer vs. desktop one

  • Building a version which implements concurrency using a fork() call - having the workload between two processes. This halved the time on both OSX and Linux, which was good

  • Built a data simulator to replicate production data loads

  • gprof profiling

  • gdb time sampling profiling (abort and backtrace)

  • Comment out vector operations

  • Had this not worked, Arne Babenhauserheide's link looks like it may well have some crucial stuff on memory fragmentation issues with OpenMP

4
votes

It's hard to know for sure what is happening without significant profiling, but the performance curve seems indicative of False Sharing...

threads use different objects but those objects happen to be close enough in memory that they fall on the same cache line, and the cache system treats them as a single lump that is effectively protected by a hardware write lock that only one core can hold at a time

Great article on the topic at Dr Dobbs

http://www.drdobbs.com/go-parallel/article/217500206?pgno=1

In particular the fact that the routines are doing a lot of malloc/free could lead to this.

One solution is to use a pool based memory allocator rather than the default allocator so that each thread tends to allocate memory from a different physical address range.

2
votes

Since the threads actually don’t interact, you could just change the code to multiprocessing. You would only have message passing in the end and it would be guaranteed that the threads don’t need to synchronize anything.

Here’s python3.2-code which basically does that (you’ll likely want to not do it in python for performance reasons - or put the for-loop into a C-function and bind that via cython. You’ll see from the code why I show it in Python anyway):

from concurrent import futures
from my_cython_module import huge_function
parameters = range(ntest)
with futures.ProcessPoolExecutor(4) as e:
    results = e.map(huge_function, parameters)
    shared_array = list(results)

That’s it. Increase the number of processes to the number of jobs you can put into the cluster and let each process just submit and monitor a job to scale to any number of calls.

Huge functions without interaction and small input values almost call out for multiprocessing. And as soon as you have that, switching up to MPI (with almost unlimited scaling) is not too hard.

From the technical side, AFAIK context switches in Linux are quite expensive (monolithic kernel with much kernel-space memory), while they are much cheaper on OSX or the Hurd (Mach microkernel). That might explain the huge amount of system time you see on Linux but not on OSX.