4
votes

I seem to encounter a limit to the number of asynchronous kernel launches that can be queued up in the compute engine queue. After this limit the host is blocked and GPU-CPU concurrency is lost. This is not mentioned in the CUDA programming guide.

  • What is the maximum number of asynchronous kernel launches that can be queued up in the compute engine queue?
  • Does this maximum number depend in some way on the kernel being launched?
  • Does the time it takes for the CPU to put a kernel launch in the compute engine queue depend on the kernel being launched?
  • What is the maximum number of asynchronous memcpy's that can be queued up in the copy engine queue?
1

1 Answers

3
votes

I am not sure there is a universal answer to this question, to a degree it is platform and CUDA version specific AFAIK. To answer you bullet points

  • The limit is queue size, I believe, so there is a maximum number of queue operations rather than kernel launches. The same total limit should apply to any combination of kernels, copy operations and stream events. What that total number of operations is depends on platform and CUDA version
  • No
  • No, but once the driver queue is filled, the time taken to submit any asynchronous operation will be considerably increased
  • See the first point. I don't believe the driver distinguishes between copies, kernel launches, or events.

I can recall doing some benchmarking circa CUDA 2.1 and finding that everything ran quickly up until 24 operations had been queued, then the time taken for subsequent operations to be queued slowed. By the time CUDA 3.0 had been released, I didn't have any code which could hit the limit that existed in older versions, so something changed. It should be trivial to write a benchmark to check what more modern CUDA versions do.