I have a multi-threaded CPU and I would like each thread of the CPU to be able to launch a seperate CUDA stream. The seperate CPU threads will be doing different things at different times so there is a chance that they won't overlap but if they do launch a CUDA kernel at the same time I would like it to continue to run concurrently.
I'm pretty sure this is possible because in the CUDA Toolkit documentation section 3.2.5.5. It says "A stream is a sequence of commands (possibly issued by different host threads)..."
So if I want to implement this I would do something like
void main(int CPU_ThreadID) {
cudaStream_t *stream;
cudaStreamCreate(&stream);
int *d_a;
int *a;
cudaMalloc((void**)&d_a, 100*sizeof(int));
cudaMallocHost((void**)&a, 100*8*sizeof(int));
cudaMemcpyAsync(d_a, a[100*CPU_ThreadID], 100*size(int), cudaMemcpyHostToDevice, stream);
sum<<<100,32,0,stream>>>(d_a);
cudaStreamDestroy(stream);
}
That is just a simple example. If I know there are only 8 CPU Threads then I know at most 8 streams will be created. Is this the proper way to do this? Will this run concurrently if two or more different host threads reach this code around the same time? Thanks for any help!
Edit:
I corrected some of the syntax issues in the code block and put in the cudaMemcpyAsync as sgar91 suggested.
malloc
thestream
pointer. Also, you may consider usingcudaMemcpyAsync
if you want the streams to overlap. – sgarizvi