In the CUDA documentation, it is mentioned that if we use 2 streams (stream0 and stream1) like this way: we copy data in stream0 then we launch the first kernel in stream0 , then we recuperate data from the device in stream0, and then the same operations are made in stream1, this way, like mentioned in the book "CUDA by example 2010", doesn't offer the concurrent execution, but in the "concurrent kernels sample" this method is used and offers the concurrent execution. So can you help me please to understand the difference between the two examples?
2 Answers
Overlapped data transfer depends on many factors including compute capability version and coding styles. This blog may provide more info.
https://developer.nvidia.com/content/how-overlap-data-transfers-cuda-cc
I'm just expanding Eric's answer.
In the CUDA C Programming Guide, the example is reported of using 2
streams, say stream0
and stream1
, to do the following
CASE A
memcpyHostToDevice --- stream0
kernel execution --- stream0
memcpyDeviceToHost --- stream0
memcpyHostToDevice --- stream1
kernel execution --- stream1
memcpyDeviceToHost --- stream1
In other words, all the operations of stream0
are issued first and then those regarding stream1
. The same example is reported in the "CUDA By Example" book, Section 10.5, but it is "apparently" concluded (in "apparent" contradition with the guide) that in this way concurrency is not achieved.
In Section 10.6 of "CUDA By Example", the following alternative use of streams is proposed
CASE B
memcpyHostToDevice --- stream0
memcpyHostToDevice --- stream1
kernel execution --- stream0
kernel execution --- stream1
memcpyDeviceToHost --- stream0
memcpyDeviceToHost --- stream1
In other words, the mem copy operations and kernel executions of stream0
and stream1
are now interlaced. The book points how with this solution concurrency is achieved.
Actually, there is no contradition between the "CUDA By Example" book and the CUDA C Programming guide, since the discussion in the book has been carried out with particular reference to a GTX 285 card while, as already pointed out by Eric and in the quoted blog post How to Overlap Data Transfers in CUDA C/C++, concurrency can be differently achieved on different architectures, as a result of dependencies and copy engines available.
For example, the blog considers two cards: C1060 and C2050. The former has one kernel engine and one copy engine which can issue only one memory transaction (H2D or D2H) at a time. The latter has one kernel engine and two copy engines which can simultaneously issue two memory transactions (H2D and D2H) at a time. What happens for the C1060, having only one copy engine, is the following
CASE A - C1060 - NO CONCURRENCY ACHIEVED
Stream Kernel engine Copy engine Comment
stream0 ---- memcpyHostToDevice ----
stream0 ---- kernel execution ---- Depends on previous memcpy
stream0 ---- memcpyDeviceToHost ---- Depends on previous kernel
stream1 ---- memcpyHostToDevice ----
stream1 ---- kernel execution ---- Depends on previous memcpy
stream1 ---- memcpyDeviceToHost ---- Depends on previous kernel
CASE B - C1060 - CONCURRENCY ACHIEVED
Stream Kernel engine Copy engine Comment
stream0 ---- memcpyHostToDevice 0 ----
stream0/1 ---- Kernel execution 0 ---- memcpyHostToDevice 1 ----
stream0/1 ---- Kernel execution 1 ---- memcpyDeviceToHost 0 ----
stream1 ---- memcpyDeviceToHost 1 ----
Concerning the C2050 and considering the case of 3
streams, in CASE A concurrency is now achieved, opposite to C1060.
CASE A - C2050 - CONCURRENCY ACHIEVED
Stream Kernel engine Copy engine H2D Copy engine D2H
stream0 ---- memcpyHostToDevice 0 ----
stream0/1 ---- kernel execution 0 ---- memcpyHostToDevice 1 ----
stream0/1/2 ---- kernel execution 1 ---- memcpyHostToDevice 2 ---- memcpyDeviceToHost 0
stream0/1/2 ---- kernel execution 2 ---- memcpyDeviceToHost 1
stream2 ---- memcpyDeviceToHost 2