4
votes

With CUDA SDK 5.5 I can use to copying data:

  • from host: cudaMemcpy(); to use GPU-DMA if memory pinned
  • from host: memcpy(); or cudaMemcpy(); to use CPU Cores if memory isn't pinned
  • from gpu: for() { dst[i] = src[i]; } or memcpy(); to use GPU Cores
  • from gpu: to use GPU-DMA ???

How can I use GPU-DMA in kernel-function of GPU-CUDA code to copying data?

1
What is exactly your problem? Accessing the memory space of one GPU from another GPU? In this case, you can use Peer-to-Peer (P2P) GPU memory copies, have a look at the simpleP2P SDK CUDA example. - Vitality
@JackOLantern Problem in that: for Peer-to-Peer (P2P) GPU memory copies I need to use cudaMemcpy();, but I can't use it in the kernel-function of CUDA-code, as said here: on-demand.gputechconf.com/gtc-express/2011/presentations/… I want to initiate async copying data previously (as prefetch its) from remote GPU-RAM to current GPU-RAM for reduce latency when I will use access to current GPU-RAM instead of remote GPU-RAM. - Alex
If your system supports UVA, then you can address one GPU memory space from another GPU inside a kernel function like dst[i] = src[i];. - Vitality
@JackOLantern Yes, this is written in the third point in my question, but bandwidth much less than access to the current global memory. - Alex

1 Answers

4
votes

What you are trying to do is so not possible from device side if it does not support compute capability 3.5. If you have such a card see edit.

Yes you can access GPU RAM from another device by passing a device pointer allocated on another device to your kernel. Than the execution runtime will provide the requested data onto the right GPU. However, this isn't very efficient because every access to another devices memory results in a memcopy operation either peer-to-peer or device-host-device.

What you can do is to perform prefetch data from within your host code and use different streams for your memcopy operations (cudaMemcpy_async) and your kernel executions. However this works only if you have a decent card with one separated copy unit and you have to do explicit locking because there are no build in structures that will hold your kernel until the data transfer is finished.

EDIT:

If you have a compute capbility 3.5 device you can use the cuda device runtime for memcopy from device to device within your device code. See the dynamic parallelism documentation here: http://docs.nvidia.com/cuda/pdf/cuda_dynamic_parallelism_programming_guide.pdf Note that all memcopy operations on the device are also asynchronous. And you will heave to preserve data coherence again on your own.