Creating a cuda stream on each host thread (multi-threaded CPU)

Question

I have a multi-threaded CPU and I would like each thread of the CPU to be able to launch a seperate CUDA stream. The seperate CPU threads will be doing different things at different times so there is a chance that they won't overlap but if they do launch a CUDA kernel at the same time I would like it to continue to run concurrently.

I'm pretty sure this is possible because in the CUDA Toolkit documentation section 3.2.5.5. It says "A stream is a sequence of commands (possibly issued by different host threads)..."

So if I want to implement this I would do something like

void main(int CPU_ThreadID) {
    cudaStream_t *stream;
    cudaStreamCreate(&stream);

    int *d_a;
    int *a;
    cudaMalloc((void**)&d_a, 100*sizeof(int));
    cudaMallocHost((void**)&a, 100*8*sizeof(int));
    cudaMemcpyAsync(d_a, a[100*CPU_ThreadID], 100*size(int), cudaMemcpyHostToDevice, stream);
    sum<<<100,32,0,stream>>>(d_a);

    cudaStreamDestroy(stream);
}

That is just a simple example. If I know there are only 8 CPU Threads then I know at most 8 streams will be created. Is this the proper way to do this? Will this run concurrently if two or more different host threads reach this code around the same time? Thanks for any help!

Edit:

I corrected some of the syntax issues in the code block and put in the cudaMemcpyAsync as sgar91 suggested.

You don't have to malloc the stream pointer. Also, you may consider using cudaMemcpyAsync if you want the streams to overlap. — sgarizvi
@sgar91: The streams will be in different contexts so they will never overlap. — talonmies
@sgar91 Thanks! I've made those edits above. To talonmies: So is there no way for separate CPU threads to access the GPU device concurrently? What about forcing the threads to use the same context as Robert Crovella suggested below? — Miggy

Robert Crovella Robert Crovella · Accepted Answer · 2013-10-15T19:27:19

It really looks to me like you are proposing a multi-process application, not multithreaded. You don't mention which threading architecture you have in mind, nor even an OS, but the threading architectures I know of don't posit a thread routine called "main", and you haven't shown any preamble to the thread code.

A multi-process environment will generally create one device context per process, which will inhibit fine-grained concurrency.

Even if that's just an oversight, I would point out that a multi-threaded application should establish a GPU context on the desired device before threads are spawned.

Each thread can then issue a cudaSetDevice(0); or similar call, which should cause each thread to pick up the established context on the indicated device.

Once that is in place, you should be able to issue commands to the desired streams from whichever threads you like.

You may wish to refer to the cudaOpenMP sample code. Although it omits the streams concepts, it demonstrates a multi-threaded app with the potential for multiple threads to issue commands to the same device (and could be extended to the same stream)

Whether or not kernels happen to run concurrently or not after the above issues have been addressed is a separate issue. Concurrent kernel execution has a number of requirements, and the kernels themselves must have compatible resource requirements (blocks, shared memory, registers, etc.), which generally implies "small" kernels.

Creating a cuda stream on each host thread (multi-threaded CPU)

1 Answers