2
votes

Sometime back NVIDIA introduced the concept of vGPUs utilizing its GRID GPUs, where a physical GPU is virtualized into multiple vGPUs, each of which is assigned to a guest VM. While the GRID documentation is pretty clear on memory seggregation, what's not clear is how the kernels originating from guestVM execute on the main hardware.

The Grid datasheet in one of its tables (table 1) mentions CUDA Cores(time-sliced shared) . Does this imply a CUDA kernel originating from one guest VM captures the entire GPU for a time-slice followed by kernels from other VMs?

Reference to GRID Datasheet: http://images.nvidia.com/content/pdf/grid/whitepaper/NVIDIA-GRID-WHITEPAPER-vGPU-Delivering-Scalable-Graphics-Rich-Virtual-Desktops.pdf

1

1 Answers

3
votes

Currently, CUDA operations originating from a VM that is using a GRID vGPU are not possible with one exception.

If the GRID (2.0) profile in use is a profile that will map an entire physical GPU to a single VM, then CUDA operations are possible in that situation. In that situation the general CUDA behavior should be similar to bare-metal operation.

Currently, such behavior does involve time-slicing ("context switching") between graphics operations and CUDA operations, the same behavior as is witnessed in a bare-metal scenario.

This is subject to change in the future.

Note that "physical GPU" here refers to a complete logical GPU device. A Tesla M60, for example, has two such "physical GPUs" on-board (each with 8GB of memory), and therefore could support two such VMs, where the selected GRID profile dictates that an entire physical GPU is mapped to a single VM.

A reference to this behavior can be found here:

However it should be noted that there are some limitations here, with NVIDIA noting that CUDA vGPU support requires using the GRID 2.0 “8GB profile.”