I've been looking into ways to reduce the latency caused by transferring data back and forth from the CPU and GPU. When I first started using CUDA I did notice that data transfer between the CPU and GPU did take a few seconds, but I didn't really care because this isn't really a concern for the small programs I'm been writing. In fact, the latency probably isn't much of a problem for vast majority of the programs that utilize GPUs, video games included, because they're still a lot faster than if they would have run on the CPU.
However, I'm a bit of an HPC enthusiast and I became concerned with the direction of my studies when I saw the massive discrepancy between the Tianhe-I theoretical peak FLOPS and the actual LINPACK measured performance. This has raised my concerns about whether I'm taking the right career path.
Use of pinned memory (page-locked) memory through the use of the cudaHostAlloc() function is one method of reducing latency (quite effective), but are there any other techniques I'm not aware of? And to be clear, I'm talking about optimizing the code, not the hardware itself (that's NVIDIA and AMD's jobs).
Just as a side question, I'm aware that Dell and HP sell Tesla servers. I'm curious as to how well a GPU leverages a database application, where you would need a constant read from the hard drive (HDD or SSD), an operation only the CPU can perform,