I have a kernel which I want to start with the configuration "1 block x 32 threads". To increase parallelism I want to start several streams instead of running a bigger "work package" than "1 block x 32 threads". I want to use the GPU in a program where data comes from the network. I don't want to wait until a bigger "work package" is available. The code is like:
Thread(i=0..14) {
- copy data Host -> GPU [cudaMemcpyAsync(.., stream i)]
- run kernel(stream i)
- copy data GPU -> Host [cudaMemcpyAsync(.., stream i)]
}
The real code is much more complex but I want to keep it simple (15 CPU threads use the GPU).
The code works but streams doesn't run concurrently as expected. The GTX 480 has 15 SMs where each SM has 32 shader processors. I expect that if I start the kernel 15 times, all 15 streams run in parallel, but this is not the case. I have used the Nvidia Visual Profiler and there is a maximum of 5 streams which run in parallel. Often only one stream runs. The performance is really bad.
I get the best results with a "64 block x 1024 threads" configuration. If I use instead a "32 block x 1024 threads" configuration but two streams the streams are executed one after each other and performance drops. I am using Cuda Toolkit 5.5 and Ubuntu 12.04.
Can somebody explain why this is the case and can give me some background information? Should it work better on newer GPUs? What is the best way to use the GPU in time critically applications where you don't want to buffer data? Probably this is not possible, but I am searching for techniques which bring me closer to a solution.
News:
I did some further research. The problem is the last cudaMemcpyAsync(..) (GPU->Host copy) call. If I remove it, all streams run concurrent. I think the problem is illustrated in http://on-demand.gputechconf.com/gtc-express/2011/presentations/StreamsAndConcurrencyWebinar.pdf on slide 21. They say that on Fermi there are two copy queues but this is only true for tesla and quadro cards, right? I think the problem is that the GTX 480 has only one copy queue and all copy commands (host->GPU AND GPU->host) are put in this one queue. Everything is non-blocking and the GPU->host memcopy of the first thread blocks the host->GPU memcopy calls of other threads. Here some observations:
Thread(i=0..14) {
- copy data Host -> GPU [cudaMemcpyAsync(.., stream i)]
- run kernel(stream i)
}
-> works: streams run concurrently
Thread(i=0..14) {
- copy data Host -> GPU [cudaMemcpyAsync(.., stream i)]
- run kernel(stream i)
- sleep(10)
- copy data GPU -> Host [cudaMemcpyAsync(.., stream i)]
}
-> works: streams run concurrently
Thread(i=0..14) {
- copy data Host -> GPU [cudaMemcpyAsync(.., stream i)]
- run kernel(stream i)
- cudaStreamSynchronize(stream i)
- copy data GPU -> Host [cudaMemcpyAsync(.., stream i)]
}
-> doesn't work!!! Maybe cudaStreamSynchronize is put in the copy-queue?
Does someone knows a solution for this problem. Something like a blocking-kernel call would be cool. The last cudaMemcpyAsync() (GPU->device) should be called if the kernel has been finished.
Edit2: Here an example to clarify my problem: To keep it simple we have 2 streams:
Stream1:
------------
HostToGPU1
kernel1
GPUToHost1
Stream2:
------------
HostToGPU2
kernel2
GPUToHost2
The first stream is started. HostToGPU1 is executed, kernel1 is launched and GPUToHost1 is called. GPUToHost1 blocks because kernel1 is running. In the meantime Stream2 is started. HostToGPU2 is called, Cuda puts it in the queue but it can't be executed because GPUToHost1 blocks until kernel 1 has been finished. There are no data transfers in the moment. Cuda just waits for GPUToHost1. So my idea was to call GPUToHost1 when kernel1 is finished. This seams to be the reason why it works with sleep(..) because GPUToHost1 is called when the kernel has been finished. A kernel-launch which automatically blocks the CPU-thread would be cool. GPUToHost1 is not blocking in the queue (if there are no other data transfers at the time but in my case, data transfer are not time-consuming).
/usr/local/cuda/samples/6_Advanced/concurrentKernels
) Can you edit your question with the results of that test please? – Robert Crovella