I have an application where I split the processing load among the GPUs on a user's system. Basically, there is CPU thread per GPU that initiates a GPU processing interval when triggered periodically by the main application thread.
Consider the following image (generated using NVIDIA's CUDA profiler tool) for an example of a GPU processing interval -- here the application is using a single GPU.
As you can see a big portion of the GPU processing time is consumed by the two sorting operations and I am using the Thrust library for this (thrust::sort_by_key). Also, it looks like thrust::sort_by_key calls a few cudaMallocs under the hood before it starts the actual sort.
Now consider the same processing interval where the application has spread the processing load over two GPUs:
In a perfect world you would expect the 2 GPU processing interval to be exactly half that of the single GPU (because each GPU is doing half the work). As you can see, this it not the case partially because the cudaMallocs seem to take longer when they are called simultaneously (sometimes 2-3 times longer) due to some sort of contention issue. I don't see why this needs to be the case because the memory allocation space for the 2 GPUs are completely independent so there shouldn't be a system-wide lock on cudaMalloc -- a per-GPU lock would be more reasonable.
To prove my hypothesis that the issue is with simultaneous cudaMalloc calls, I created a ridiculously simple program with two CPU threads (for each GPU) each calling cudaMalloc several times. I first ran this program so that the separate threads do not call cudaMalloc at the same time:
You see it takes ~175 microseconds per allocation. Next, I ran the program with the threads calling cudaMalloc simultaneously:
Here, each call took ~538 microseconds or 3 times longer than the previous case! Needless to say, this is slowing down my application tremendously and it stands to reason the issue would only get worse with more than 2 GPUs.
I have noticed this behavior on Linux and Windows. On Linux, I am using Nvidia driver version 319.60 and on Windows I am using the 327.23 version. I am using CUDA toolkit 5.5.
Possible Reason: I am using a GTX 690 in these tests. This card is basically 2 680-like GPUs housed in the same unit. This is the only "multi-GPU" setup I've run, so perhaps the cudaMalloc issue has something to do with some hardware dependence between the 690's 2 GPUs?