0
votes

In the CUDA documentation, it is mentioned that if we use 2 streams (stream0 and stream1) like this way: we copy data in stream0 then we launch the first kernel in stream0 , then we recuperate data from the device in stream0, and then the same operations are made in stream1, this way, like mentioned in the book "CUDA by example 2010", doesn't offer the concurrent execution, but in the "concurrent kernels sample" this method is used and offers the concurrent execution. So can you help me please to understand the difference between the two examples?

2

2 Answers

2
votes

Overlapped data transfer depends on many factors including compute capability version and coding styles. This blog may provide more info.

https://developer.nvidia.com/content/how-overlap-data-transfers-cuda-cc

2
votes

I'm just expanding Eric's answer.

In the CUDA C Programming Guide, the example is reported of using 2 streams, say stream0 and stream1, to do the following

CASE A

memcpyHostToDevice --- stream0
kernel execution   --- stream0
memcpyDeviceToHost --- stream0

memcpyHostToDevice --- stream1
kernel execution   --- stream1
memcpyDeviceToHost --- stream1

In other words, all the operations of stream0 are issued first and then those regarding stream1. The same example is reported in the "CUDA By Example" book, Section 10.5, but it is "apparently" concluded (in "apparent" contradition with the guide) that in this way concurrency is not achieved.

In Section 10.6 of "CUDA By Example", the following alternative use of streams is proposed

CASE B

memcpyHostToDevice --- stream0
memcpyHostToDevice --- stream1
kernel execution   --- stream0
kernel execution   --- stream1
memcpyDeviceToHost --- stream0
memcpyDeviceToHost --- stream1

In other words, the mem copy operations and kernel executions of stream0 and stream1 are now interlaced. The book points how with this solution concurrency is achieved.

Actually, there is no contradition between the "CUDA By Example" book and the CUDA C Programming guide, since the discussion in the book has been carried out with particular reference to a GTX 285 card while, as already pointed out by Eric and in the quoted blog post How to Overlap Data Transfers in CUDA C/C++, concurrency can be differently achieved on different architectures, as a result of dependencies and copy engines available.

For example, the blog considers two cards: C1060 and C2050. The former has one kernel engine and one copy engine which can issue only one memory transaction (H2D or D2H) at a time. The latter has one kernel engine and two copy engines which can simultaneously issue two memory transactions (H2D and D2H) at a time. What happens for the C1060, having only one copy engine, is the following

CASE A - C1060 - NO CONCURRENCY ACHIEVED

Stream       Kernel engine         Copy engine             Comment

stream0 ----                       memcpyHostToDevice ----
stream0 ---- kernel execution ----                         Depends on previous memcpy
stream0 ----                       memcpyDeviceToHost ---- Depends on previous kernel
stream1 ----                       memcpyHostToDevice ---- 
stream1 ---- kernel execution ----                         Depends on previous memcpy
stream1 ----                       memcpyDeviceToHost ---- Depends on previous kernel

CASE B - C1060 - CONCURRENCY ACHIEVED

Stream         Kernel engine           Copy engine               Comment

stream0   ----                         memcpyHostToDevice 0 ----
stream0/1 ---- Kernel execution 0 ---- memcpyHostToDevice 1 ----  
stream0/1 ---- Kernel execution 1 ---- memcpyDeviceToHost 0 ---- 
stream1   ----                         memcpyDeviceToHost 1 ---- 

Concerning the C2050 and considering the case of 3 streams, in CASE A concurrency is now achieved, opposite to C1060.

CASE A - C2050 - CONCURRENCY ACHIEVED

Stream           Kernel engine           Copy engine H2D           Copy engine D2H

stream0     ----                         memcpyHostToDevice 0 ----
stream0/1   ---- kernel execution 0 ---- memcpyHostToDevice 1 ----                              
stream0/1/2 ---- kernel execution 1 ---- memcpyHostToDevice 2 ---- memcpyDeviceToHost 0
stream0/1/2 ---- kernel execution 2 ----                           memcpyDeviceToHost 1
stream2     ----                                                   memcpyDeviceToHost 2