Poor performance when calling cudaMalloc with 2 GPUs simultaneously

Question

I have an application where I split the processing load among the GPUs on a user's system. Basically, there is CPU thread per GPU that initiates a GPU processing interval when triggered periodically by the main application thread.

Consider the following image (generated using NVIDIA's CUDA profiler tool) for an example of a GPU processing interval -- here the application is using a single GPU.

enter image description here

As you can see a big portion of the GPU processing time is consumed by the two sorting operations and I am using the Thrust library for this (thrust::sort_by_key). Also, it looks like thrust::sort_by_key calls a few cudaMallocs under the hood before it starts the actual sort.

Now consider the same processing interval where the application has spread the processing load over two GPUs:

enter image description here

In a perfect world you would expect the 2 GPU processing interval to be exactly half that of the single GPU (because each GPU is doing half the work). As you can see, this it not the case partially because the cudaMallocs seem to take longer when they are called simultaneously (sometimes 2-3 times longer) due to some sort of contention issue. I don't see why this needs to be the case because the memory allocation space for the 2 GPUs are completely independent so there shouldn't be a system-wide lock on cudaMalloc -- a per-GPU lock would be more reasonable.

To prove my hypothesis that the issue is with simultaneous cudaMalloc calls, I created a ridiculously simple program with two CPU threads (for each GPU) each calling cudaMalloc several times. I first ran this program so that the separate threads do not call cudaMalloc at the same time:

enter image description here

You see it takes ~175 microseconds per allocation. Next, I ran the program with the threads calling cudaMalloc simultaneously:

enter image description here

Here, each call took ~538 microseconds or 3 times longer than the previous case! Needless to say, this is slowing down my application tremendously and it stands to reason the issue would only get worse with more than 2 GPUs.

I have noticed this behavior on Linux and Windows. On Linux, I am using Nvidia driver version 319.60 and on Windows I am using the 327.23 version. I am using CUDA toolkit 5.5.

Possible Reason: I am using a GTX 690 in these tests. This card is basically 2 680-like GPUs housed in the same unit. This is the only "multi-GPU" setup I've run, so perhaps the cudaMalloc issue has something to do with some hardware dependence between the 690's 2 GPUs?

The usual recommendation for high performance code is to get the malloc operations out of any performance loops. I realize this is not a trivial matter since you're using thrust. There are high performance sort libraries that can replace the thrust sort_by_key that will allow you to do the allocations ahead of time and reuse them for the sort operations. CUB, b40c, and MGPU are all possibilities. — Robert Crovella
Yeah I've looked into CUB and b40c (the b40c site say the project has been deprecated). Before I do the work to remove thrust, I would like to see some comparison graphs between the libraries. Could you point me to some performance numbers? Which library do you recommend? ... It seems that thrust isn't very high performance, for example, I already switched out a bunch of thrust::reduce and reduce_by_key calls with my own custom kernels -- doing this cut my processing time in half. No joke. — rmccabe3701
Thrust actually is based on a particular variant of b40c (or used to be). For equivalent test cases, there was not much difference in my testing between b40c and MGPU. In one test that I ran, I was only sorting on about 22 bits of a 32 bit value. MGPU had a dial I could turn to only sort on 22bits, and I observed about a 40% speedup over thrust doing that. I have not used CUB much. If you poke through those links, you may find some performance data. For example some MGPU perf data here — Robert Crovella
In case it's not clear, my point in suggesting these alternate libraries is not that they have higher sort performance than thrust (although they might, I'm not sure what the results would be in your test case) but that they allow the option of unbundling the temporary data allocations that thrust is doing, so that you can handle these once, up front. — Robert Crovella

talonmies talonmies · Accepted Answer · 2013-10-05T10:22:48

I will preface this with a disclaimer: I'm not privy to the internals of the NVIDIA driver, so this is somewhat speculative.

The slow down you are seeing is just driver level contention caused by competition from multiple threads calling device malloc simultaneously. Device memory allocation requires a number of OS system calls, as does driver level context switching. There is a non-trivial amount of latency in both operations. It is probable that the extra time you see when two threads try and allocate memory simultaneously is caused by the additional driver latency from switching from one device to another throughout the sequence of system calls required to allocate memory on both devices.

I can think of a few ways you should be able to mitigate this:

You could reduce the system call overhead of thrust memory allocation to zero by writing your own custom thrust memory allocator for the device that works off a slab of memory allocated during initialisation. This would get rid of all of the system call overhead within each sort_by_key, but the effort of writing your own user memory manager is non trivial. On the other hand it leaves the rest of your thrust code intact.
You could switch to an alternative sort library and take back the manage the allocation of temporary memory yourself. If you do all the allocation in an initialization phase, the cost of the one time memory allocations can be amortized to almost zero over the life of each thread.

In multi-GPU CUBLAS based linear algebra codes I have written, I combined both ideas and wrote a standalone user space device memory manager which works off a one time allocated device memory pool. I found that removing all of overhead cost of intermediate device memory allocations yielded a useful speed up. Your use case might benefit from a similar strategy.

Poor performance when calling cudaMalloc with 2 GPUs simultaneously

2 Answers