3
votes

GPU is really fast when it comes to paralleled computation and out performs CPU with being 15-30 ( some have reported even 50 ) times faster however, GPU memory is very limited compared to CPU memory and communication between GPU memory and CPU is not as fast.

Lets say we have some data what won't fit into GPU ram but we still want to use it's wonders to compute. What we can do is split that data into pieces and feed it into GPU one by one.

Sending large data to GPU can take time and one might think, what if we would split a data piece into two and feed the first half, run the kernel and then feed the other half while kernel is running.

By that logic we should save some time because data transfer should be going on while computation is, hopefully not interrupting it's job and when finished, it can just, well, continue it's job without needs for waiting a new data path.

I must say that I'm new to gpgpu, new to cuda but I have been experimenting around with simple cuda codes and have noticed that the function cudaMemcpy used to transfer data between CPU and GPU will block if kerner is running. It will wait until kernel is finished and then will do its job.


My question, is it possible to accomplish something like that described above and if so, could one show an example or provide some information source of how it could be done?

Thank you!

1
I need help coming up with better tittle, Would be really greatful, Thank you!Aiden Anomaly

1 Answers

7
votes

is it possible to accomplish something like that described above

Yes, it's possible. What you're describing is a pipelined algorithm, and CUDA has various asynchronous capabilities to enable it.

The asynchronous concurrent execution section of the programming guide covers the necessary elements in CUDA to make it work. To use your example, there exists a non-blocking version of cudaMemcpy, called cudaMemcpyAsync. You'll need to understand CUDA streams and how to use them.

I would also suggest this presentation which covers most of what is needed.

Finally, here is a worked example. That particular example happens to use CUDA stream callbacks, but those are not necessary for basic pipelining. They enable additional host-oriented processing to be asynchronously triggered at various points in the pipeline, but the basic chunking of data, and delivery of data while processing is occurring does not depend on stream callbacks. Note also the linked CUDA sample codes in that answer, which may be useful for study/learning.