I seem to encounter a limit to the number of asynchronous kernel launches that can be queued up in the compute engine queue. After this limit the host is blocked and GPU-CPU concurrency is lost. This is not mentioned in the CUDA programming guide.
- What is the maximum number of asynchronous kernel launches that can be queued up in the compute engine queue?
- Does this maximum number depend in some way on the kernel being launched?
- Does the time it takes for the CPU to put a kernel launch in the compute engine queue depend on the kernel being launched?
- What is the maximum number of asynchronous memcpy's that can be queued up in the copy engine queue?