I have tried to overlap kernel executions with memcpyasync D2H but it doesn't work. I have N groups of elements. Each group has 64 elements which can be processed in parallel. While processing of groups must be sequential due to data dependency. That is, the elements of group i must be processed after processing of the elements of group i-1. Processing of each element of a group produces one output which must be transferred from GPU to CPU. To overlap this D2H data transfer, I partitioned the elements of a group into multiple chunks so that kernel execution and D2H MemcpyAsync on a given chunk can be overlapped using the streams. I used following pseudo code to process N groups of elements using K streaming.
groupId=0;
`while( groupId< N){`
for(i=0;i<K;++i)
// all stream must wait to complete the kernel execution
of last stream before starting of the processing of next group
if( groupId!=0)
cudaStreamWaitEvent(stream[K-1],syncEvent,0)
kernelA<<< >>>(----,----,----);
CUDAEventRecord(syncEvent,stream[K-1]);
cudaMemcpyAsync(,,,cudaMemcpyDeviceToHost,stream[i]);
}
groupId++
}
When I use two streams then there are some overlapping while when I increase the number of streams there is no overlap as shown in following fig. Processing of 64 elements using two stream.
Processing of 64 elements using four stream
Kindly explain why D2H data transfer is not fully overlapped. Moreover, in case of four streams, kernel of each stream is invoked with 16 thread blocks each of size 128 threads. conceptually, two streams should be executed concurrently ( each on an SM) as enough resources are available on GPU . However, there is no concurrency in kernel executions of different streams (fig 2). What are the reasons of no concurrency in this case?