Running several streams (instead of threads/blocks) in parallel

Question

I have a kernel which I want to start with the configuration "1 block x 32 threads". To increase parallelism I want to start several streams instead of running a bigger "work package" than "1 block x 32 threads". I want to use the GPU in a program where data comes from the network. I don't want to wait until a bigger "work package" is available. The code is like:

Thread(i=0..14) {
  - copy data Host -> GPU [cudaMemcpyAsync(.., stream i)]
  - run kernel(stream i)
  - copy data GPU -> Host [cudaMemcpyAsync(.., stream i)]
}

The real code is much more complex but I want to keep it simple (15 CPU threads use the GPU).

The code works but streams doesn't run concurrently as expected. The GTX 480 has 15 SMs where each SM has 32 shader processors. I expect that if I start the kernel 15 times, all 15 streams run in parallel, but this is not the case. I have used the Nvidia Visual Profiler and there is a maximum of 5 streams which run in parallel. Often only one stream runs. The performance is really bad.

I get the best results with a "64 block x 1024 threads" configuration. If I use instead a "32 block x 1024 threads" configuration but two streams the streams are executed one after each other and performance drops. I am using Cuda Toolkit 5.5 and Ubuntu 12.04.

Can somebody explain why this is the case and can give me some background information? Should it work better on newer GPUs? What is the best way to use the GPU in time critically applications where you don't want to buffer data? Probably this is not possible, but I am searching for techniques which bring me closer to a solution.

News:

I did some further research. The problem is the last cudaMemcpyAsync(..) (GPU->Host copy) call. If I remove it, all streams run concurrent. I think the problem is illustrated in http://on-demand.gputechconf.com/gtc-express/2011/presentations/StreamsAndConcurrencyWebinar.pdf on slide 21. They say that on Fermi there are two copy queues but this is only true for tesla and quadro cards, right? I think the problem is that the GTX 480 has only one copy queue and all copy commands (host->GPU AND GPU->host) are put in this one queue. Everything is non-blocking and the GPU->host memcopy of the first thread blocks the host->GPU memcopy calls of other threads. Here some observations:

Thread(i=0..14) {
  - copy data Host -> GPU [cudaMemcpyAsync(.., stream i)]
  - run kernel(stream i)
}

-> works: streams run concurrently

Thread(i=0..14) {
  - copy data Host -> GPU [cudaMemcpyAsync(.., stream i)]
  - run kernel(stream i)
  - sleep(10)
  - copy data GPU -> Host [cudaMemcpyAsync(.., stream i)]
}

-> works: streams run concurrently

Thread(i=0..14) {
  - copy data Host -> GPU [cudaMemcpyAsync(.., stream i)]
  - run kernel(stream i)
  - cudaStreamSynchronize(stream i)
  - copy data GPU -> Host [cudaMemcpyAsync(.., stream i)]
}

-> doesn't work!!! Maybe cudaStreamSynchronize is put in the copy-queue?

Does someone knows a solution for this problem. Something like a blocking-kernel call would be cool. The last cudaMemcpyAsync() (GPU->device) should be called if the kernel has been finished.

Edit2: Here an example to clarify my problem: To keep it simple we have 2 streams:

Stream1:
------------
HostToGPU1
kernel1
GPUToHost1

Stream2:
------------
HostToGPU2
kernel2
GPUToHost2

The first stream is started. HostToGPU1 is executed, kernel1 is launched and GPUToHost1 is called. GPUToHost1 blocks because kernel1 is running. In the meantime Stream2 is started. HostToGPU2 is called, Cuda puts it in the queue but it can't be executed because GPUToHost1 blocks until kernel 1 has been finished. There are no data transfers in the moment. Cuda just waits for GPUToHost1. So my idea was to call GPUToHost1 when kernel1 is finished. This seams to be the reason why it works with sleep(..) because GPUToHost1 is called when the kernel has been finished. A kernel-launch which automatically blocks the CPU-thread would be cool. GPUToHost1 is not blocking in the queue (if there are no other data transfers at the time but in my case, data transfer are not time-consuming).

What kind of results do you get if you run the CUDA concurrent kernels sample? (It should be already available on your system at /usr/local/cuda/samples/6_Advanced/concurrentKernels) Can you edit your question with the results of that test please? — Robert Crovella
Thanks, this was a good hint! I have figured out that the problem is the cudaMemcpyAsync(..) command. If I do all host->GPU and GPU->host commands before and after starting the kernels, it works. All kernels run in parallel. Mmmh, do I use the wrong copy command? — user4811

Robert Crovella Robert Crovella · Accepted Answer · 2013-08-21T01:49:24

Concurrent kernel execution can be most easily witnessed on linux.

For a good example and an easy test, refer to the concurrent kernels sample.

Good concurrency among kernels generally requires several things:

a device which supports concurrent kernels, so a cc 2.0 or newer device
kernels that are small enough in terms of number of blocks and other resource usage (registers, shared memory) so that multiple kernels can actually execute. Kernels with larger resource requirements will typically be observed to be running serially. This is expected behavior.
proper usage of streams to enable concurrency

In addition, concurrent kernels often implies copy/compute overlap. In order for copy/compute overlap to work, you must:

be using a GPU with enough copy engines. Some GPUs have one engine, some have 2. If your GPU has one engine, you can overlap one copy operation (ie. one direction) with kernel execution. if you have 2 copy engines (your GeForce GPU has 1) you can overlap both directions of copying with kernel execution.
use pinned (host) memory for any data that will be copied to or from the GPU global memory, that will be the target (to or from) for any of the copy operations you intend to overlap
Use streams properly and the necessary async versions of the relevant api calls (e.g. cudaMemcpyAsync

Regarding your observation that the smaller 32x1024 kernels do not execute concurrently, this is likely a resource issue (blocks, registers, shared memory) preventing much overlap. If you have enough blocks in the first kernel to occupy the GPU execution resources, it's not sensible to expect additional kernels to begin executing until the first kernel is finished or mostly finished.

EDIT: Responding to question edits and additional comments below.

Yes, GTX480 has only one copy "queue" (I mentioned this explicitly in my answer, but I called it a a copy "engine"). You will only be able to get one cudaMemcpy... operation to run at any given time, and therefore only one direction (H2D or D2H) can actually be moving data at any given time, and you will only see one cudaMemcpy... operation overlap with any given kernel. And cudaStreamSynchronize causes the stream to wait until ALL CUDA operations previously issued to that stream are completed.

Note that the cudaStreamSynchronize you have in your last example should not be necessary, I don't think. Streams have 2 execution characteristics:

cuda operations (API calls, kernel calls, everything) issued to the same stream will always execute sequentially, regardless of your use of the Async API or any other considerations.
cuda operations issued to separate streams, assuming all the necessary requirements have been met, will execute asynchronously to each other.

Due to item 1, in your last case, your final "copy Data GPU->Host" operation will not begin until the previous kernel call issued to that stream is complete, even without the cudaStreamSynchronize call. So I think you can get rid of that call, i.e the 2nd case you have listed should be no different than the final case, and in the 2nd case you should not need the sleep operation either. The cudaMemcpy... issued to the same stream will not begin until all previous cuda activity in that stream is finished. This is a characteristic of streams.

EDIT2: I'm not sure we're making any progress here. The issue you pointed out in the GTC preso here (slide 21) is a valid issue, but you can't work around it by inserting additional synchronization operations, nor would a "blocking kernel" help you with that, nor is it a function of having one copy engine or 2. If you want to issue kernels in separate streams but issued in sequence with no other intervening cuda operations, then that hazard exists. The solution for this, as pointed out on the next slide, is to not issue the kernels sequentially, which is roughly comparable to your 2nd case. I'll state this again:

you have identified that your case 2 gives good concurrency
the sleep operation in that case is not needed for data integrity

If you want to provide a short sample code that demonstrates the issue, perhaps other discoveries can be made.

Running several streams (instead of threads/blocks) in parallel

1 Answers