GPU is really fast when it comes to paralleled computation and out performs CPU with being 15-30 ( some have reported even 50 ) times faster however, GPU memory is very limited compared to CPU memory and communication between GPU memory and CPU is not as fast.
Lets say we have some data what won't fit into GPU ram but we still want to use it's wonders to compute. What we can do is split that data into pieces and feed it into GPU one by one.
Sending large data to GPU can take time and one might think, what if we would split a data piece into two and feed the first half, run the kernel and then feed the other half while kernel is running.
By that logic we should save some time because data transfer should be going on while computation is, hopefully not interrupting it's job and when finished, it can just, well, continue it's job without needs for waiting a new data path.
I must say that I'm new to gpgpu, new to cuda but I have been experimenting around with simple cuda codes and have noticed that the function cudaMemcpy used to transfer data between CPU and GPU will block if kerner is running. It will wait until kernel is finished and then will do its job.
My question, is it possible to accomplish something like that described above and if so, could one show an example or provide some information source of how it could be done?
Thank you!