9
votes

Scenario:

I have two machines, a client and a server, connected with Infiniband. The server machine has an NVIDIA Fermi GPU, but the client machine has no GPU. I have an application running on the GPU machine that uses the GPU for some calculations. The result data on the GPU is never used by the server machine, but is instead sent directly to the client machine without any processing. Right now I'm doing a cudaMemcpy to get the data from the GPU to the server's system memory, then sending it off to the client over a socket. I'm using SDP to enable RDMA for this communication.

Question:

Is it possible for me to take advantage of NVIDIA's GPUDirect technology to get rid of the cudaMemcpy call in this situation? I believe I have the GPUDirect drivers correctly installed, but I don't know how to initiate the data transfer without first copying it to the host.

My guess is that it isn't possible to use SDP in conjunction with GPUDirect, but is there some other way to initiate an RDMA data transfer from the server machine's GPU to the client machine?

Bonus: If somone has a simple way to test if I have the GPUDirect dependencies correctly installed that would be helpful as well!

1
In CUDA code samples SDK you could find some sample code that demonstrates what you want - developer.nvidia.com/cuda/cuda-cc-sdk-code-samples. You would need to use cudaMemcpyAsync to asynchronously copy to the GPU w.r.t host. - Sayan
I have the CUDA SDK, but I don't see any examples using GPUDirect technology. Do you know of a specific sample program I should look at? - DaoWen
I currently don't have it downloaded, but I think "Simple Peer-to-Peer Transfers with Multi-GPU" example in the link I gave is what you want. - Sayan
I'll go take a look at that and post back if I'm wrong, but I'm not looking for GPU-to-GPU (P2P) transfers. I'm pretty sure I can do that with the normal cudaMemcpy call. What I'm looking for is a way to transfer directly from the GPU to memory on another host using RDMA and Infiniband. - DaoWen
Okay, in that case you would definitely need to use pinned memory (malloc via cudaMallocHost), or use cudaHostRegister function. I guess you just have to pin the memory, and GPUDirect would enable RDMA transfer if the setup is okay (if your throughput after doing this is any better than the current, then you could be certain about improvement). And as far as I know, GPUDirect would only accelerate cudaMemCpy, and that it cannot be removed, if you have many memcpy functions (H2D,D2H), then you could just use cudaMemcpyDefault. - Sayan

1 Answers

4
votes

GPUDirect RDMA is a new feature that will be implemented in cooperation with NVIDIA's infiniband partners. It was announced with CUDA 5.0, but it is not yet available. Watch the GPUDirect page for updates.