CUDA stream Concurrency and D2H data transfer overlap

Question

I have tried to overlap kernel executions with memcpyasync D2H but it doesn't work. I have N groups of elements. Each group has 64 elements which can be processed in parallel. While processing of groups must be sequential due to data dependency. That is, the elements of group i must be processed after processing of the elements of group i-1. Processing of each element of a group produces one output which must be transferred from GPU to CPU. To overlap this D2H data transfer, I partitioned the elements of a group into multiple chunks so that kernel execution and D2H MemcpyAsync on a given chunk can be overlapped using the streams. I used following pseudo code to process N groups of elements using K streaming.

 groupId=0;
`while( groupId< N){`

    for(i=0;i<K;++i)

// all stream must wait to complete the kernel execution
 of last stream before starting of the processing of next group

if( groupId!=0)

cudaStreamWaitEvent(stream[K-1],syncEvent,0)
kernelA<<< >>>(----,----,----);
CUDAEventRecord(syncEvent,stream[K-1]);
cudaMemcpyAsync(,,,cudaMemcpyDeviceToHost,stream[i]);
}

groupId++

}

When I use two streams then there are some overlapping while when I increase the number of streams there is no overlap as shown in following fig. Processing of 64 elements using two stream.

Processing of 64 elements using four stream

Kindly explain why D2H data transfer is not fully overlapped. Moreover, in case of four streams, kernel of each stream is invoked with 16 thread blocks each of size 128 threads. conceptually, two streams should be executed concurrently ( each on an SM) as enough resources are available on GPU . However, there is no concurrency in kernel executions of different streams (fig 2). What are the reasons of no concurrency in this case?

tera tera · Accepted Answer · 2017-04-14T20:32:31

Your transfers of 64 bytes are far too short to overlap with anything - at full PCIe 2.0 speed (approx. 6GB/s) the actual transfer takes about 10 nanoseconds. That would be about 1/1000th of a pixel width on the profiler timeline at the scale of your screenshots. The finite width of the bars and gaps is entirely due to the overhead of each transfer (set up etc.).

You want to transfer of the order of megabytes to be able to overlap transfers with compute.

CUDA stream Concurrency and D2H data transfer overlap

1 Answers