Techniques to Reduce CPU to GPU Data Transfer Latency

Question

I've been looking into ways to reduce the latency caused by transferring data back and forth from the CPU and GPU. When I first started using CUDA I did notice that data transfer between the CPU and GPU did take a few seconds, but I didn't really care because this isn't really a concern for the small programs I'm been writing. In fact, the latency probably isn't much of a problem for vast majority of the programs that utilize GPUs, video games included, because they're still a lot faster than if they would have run on the CPU.

However, I'm a bit of an HPC enthusiast and I became concerned with the direction of my studies when I saw the massive discrepancy between the Tianhe-I theoretical peak FLOPS and the actual LINPACK measured performance. This has raised my concerns about whether I'm taking the right career path.

Use of pinned memory (page-locked) memory through the use of the cudaHostAlloc() function is one method of reducing latency (quite effective), but are there any other techniques I'm not aware of? And to be clear, I'm talking about optimizing the code, not the hardware itself (that's NVIDIA and AMD's jobs).

Just as a side question, I'm aware that Dell and HP sell Tesla servers. I'm curious as to how well a GPU leverages a database application, where you would need a constant read from the hard drive (HDD or SSD), an operation only the CPU can perform,

You might find this amusing: cs.virginia.edu/~skadron/Papers/bakkum_sqlite_gpgpu10.pdf — Mikhail
@Misha You're right, this article describes the particular the challenge I had in mind. Thanks... — sj755
Regarding the low flop-efficiency of Tianhe-IA: although GPUs have massive arithmetic power (which ideed is often difficult to fully utilize), don't forget that the GPU memory architecture also plays and important role and even compute-bound kernels can seriously outperform CPU implementations. — pszilard

pszilard pszilard · Accepted Answer · 2011-06-29T01:20:55

There are a few ways to address CPU-GPU communication overhead - I hope that's what you mean by latency and not the latency of the transfer itself. Note that I deliberately used the term address instead of reduce as you do not necessarily need to reduce the latency if you can hide it. Also note that I am much more familiar with CUDA, so below I only refer to CUDA, but some features are also available in OpenCL.

As you mentioned page-locked memory has the very purpose of increasing. Additionally, one can map page-locked host memory to the GPU, mechanism which enables direct access of the data allocated from the GPU kernel without the need for additional data-transfer. This mechanism is called zero-copy transfer and it is useful if data is read/written only once accompanied by a substantial amount of computation and for GPUs with no separate memory (mobile). However, if the kernel accessing the zero-copied data is not strongly compute-bound and therefore the latency of data access cannot be hidden, page-locked but not mapped memory will be more efficient. Additionally, if the data does not fit into the GPU memory, zero-copy will still work.
Note that excessive amount of page-locked memory can cause serious slowdown on the CPU side.

Approaching the problem from a different angle, as tkerwin mentioned, asynchronous transfer (wrt the CPU thread talking to the GPU) is the key to hide CPU-GPU transfer latency by overlapping computation on the CPU with the transfer. This can be achieved with cudaMemcpyAsync() as well as using zero-copy with asynchronous kernel execution.
One can take this even further by using multiple streams to overlap transfer with kernel execution. Note that stream scheduling might need special attention for good overlapping; Tesla and Quadro cards have dual-DMA engine which enables simultaneous data transfer to and from GPU. Additionally, with CUDA 4.0 it became easier to use a GPU from multiple CPU threads, so in a multi-threaded CPU code each thread can send its own data to the GPU and launch kernels easier.

Finally, GMAC implements an asymmetric shared memory model for CUDA. One of its very interesting features is the coherency models it provides, in particular lazy- and rolling update enabling the transfer of only data modified on the CPU in a blocked fashion.
For more details see the following paper: Gelado et al. - An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems.

Techniques to Reduce CPU to GPU Data Transfer Latency

3 Answers