Concurrent GPU kernel execution from multiple processes

Question

I have an application in which I would like to share a single GPU between multiple processes. That is, each of these processes would create its own CUDA or OpenCL context, targeting the same GPU. According to the Fermi white paper[1], application-level context switching is less then 25 microseconds, but the launches are effectively serialized as they launch on the GPU -- so Fermi wouldn't work well for this. According to the Kepler white paper[2], there is something called Hyper-Q that allows for up to 32 simultaneous connections from multiple CUDA streams, MPI processes, or threads within a process.

My questions: Has anyone tried this on a Kepler GPU and verified that its kernels are run concurrently when scheduled from distinct processes? Is this just a CUDA feature, or can it also be used with OpenCL on Nvidia GPUs? Do AMD's GPUs support something similar?

[1] http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf

[2] http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf

In answer to the first question you pose, nvidia has published some hyper-Q results in a blog here. — Robert Crovella
Interesting, thanks for the link. That blog post also implies that the K10 GPUs don't have Hyper-Q, while the K20 will. — Brendan Wood
That's correct. You'll note the Kepler white paper link you posted references "GK110" in the title. The GPU on K20 is GK110. The GPU on K10 is GK104 (two of them). — Robert Crovella

Robert Crovella Robert Crovella · Accepted Answer · 2012-10-05T13:51:40

In response to the first question, NVIDIA has published some hyper-Q results in a blog here. The blog is pointing out that the developers who were porting CP2K were able to get to accelerated results more quickly because hyper-Q allowed them to use the application's MPI structure more or less as-is and run multiple ranks on a single GPU, and get higher effective GPU utilization that way. As mentioned in the comments, this (hyper-Q) feature is only available on K20 processors currently, as it is dependent on the GK110 GPU.

Concurrent GPU kernel execution from multiple processes

2 Answers