4
votes

I'm not finding much info on the mechanics of a kernel launch operation. The API say to see the CudaProgGuide. And I'm not finding much there either.
Being that kernel execution is asynch, and some machines support concurrent execution, I'm lead to believe there is a queue for the kernels.

    Host code:      
    1. malloc(hostArry, ......);  
    2. cudaMalloc(deviceArry, .....);  
    3. cudaMemcpy(deviceArry, hostArry, ... hostToDevice);
    4. kernelA<<<1,300>>>(int, int);  
    5. kernelB<<<10,2>>>(float, int));  
    6. cudaMemcpy(hostArry, deviceArry, ... deviceToHost);  
    7. cudaFree(deviceArry);

Line 3 is synchronous. Line 4 & 5 are asynchronous, and the machine supports concurrent execution. So at some point, both of these kernels are running on the GPU. (There is the possibility that kernelB starts and finishes, before kernelA finishes.) While this is happening, the host is executing line 6. Line 6 is synchronous with respect to the copy operation, but there is nothing preventing it from executing before kernelA or kernelB has finished.

1) Is there a kernel queue in the GPU? (Does the GPU block/stall the host?)
2) How does the host know that the kernel has finished, and it is "safe" to Xfer the results from the device to the host?

2
Work submit to the GPU is submit to streams. Each stream of work will executed in order. The code sample uses the default stream so all operations will execute in order on the GPU. The CPU may start executing cudaMemcpy before kernelA starts executing on GPU. However, the memory operation will occur only after kernelB completes. For more information see CUDA C/C++ Streams and Concurrency webinar. - Greg Smith
@Doug, you seem to be confusing CPU/GPU concurrency (asynchrony) with GPU/GPU concurrency. In your example, there is no possibility that kernelB could start or finish before kernelA finishes, because they are both launched in the NULL stream and therefore serialized. If they were in separate streams, they might execute concurrently on the GPU; but the kernel invocations are asynchronous (the CPU continues executing while they run). The cudaMemcpy() call is synchronous by definition - it waits until the memcpy is done before returning. See Section 2.5: bit.ly/TbZcq4 - ArchaeaSoftware

2 Answers

4
votes

Yes, there are a variety of queues on the GPU, and the driver manages those.

Asynchronous calls return more or less immediately. Synchronous calls do not return until the operation is complete. Kernel calls are asynchronous. Most other CUDA runtime API calls are designated by the suffix Async if they are asynchronous. So to answer your question:

1) Is there a kernel queue in the GPU? (Does the GPU block/stall the host?)

There are various queues. The GPU blocks/stalls the host on a synchronous call, but the kernel launch is not a synchronous operation. It returns immediately, before the kernel has completed, and perhaps before the kernel has even started. When launching operations into a single stream, all CUDA operations in that stream are serialized. Therefore, even though kernel launches are asynchronous, you will not observed overlapped execution for two kernels launched to the same stream, because the CUDA subsystem guarantees that a given CUDA operation in a stream will not start until all previous CUDA operations in the same stream have finished. There are other specific rules for the null stream (the stream you are using if you don't explicitly call out streams in your code) but the preceding description is sufficient for understanding this question.

2) How does the host know that the kernel has finished, and it is "safe" to Xfer the results from the device to the host?

Since the operation that transfers results from the device to the host is a CUDA call (cudaMemcpy...), and it is issued in the same stream as the preceding operations, the device and CUDA driver manage the execution sequence of cuda calls so that the cudaMemcpy does not begin until all previous CUDA calls issued to the same stream have completed. Therefore a cudaMemcpy issued after a kernel call in the same stream is guaranteed not to start until the kernel call is complete, even if you use cudaMemcpyAsync.

0
votes

You can use cudaDeviceSynchronize() after a kernel call to guarantee that all previous tasks requested to the device has been completed. If the results of kernelB are independent from the results on kernelA, you can set this function right before the memory copy operation. If not, you will need to block the device before calling kernelB, resulting in two blocking operations.