I have some questions.
Recently I'm making a program by using CUDA.
In my program, there is one big data on Host programmed with std::map(string, vector(int)).
By using these datas some vector(int) are copied to GPUs global memory and processed on GPU
After processing, some results are generated on GPU and these results are copied to CPU.
These are all my program schedule.
- cudaMemcpy( ... , cudaMemcpyHostToDevice)
- kernel function(kernel function only can be done when necessary data is copied to GPU global memory)
- cudaMemcpy( ... , cudaMemcpyDeviceToHost)
- repeat 1~3steps 1000times (for another data(vector) )
But I want to reduce processing time.
So I decided to use cudaMemcpyAsync function in my program.
After searching some documents and web pages, I realize that to use cudaMemcpyAsync function host memory which has data to be copied to GPUs global memory must be allocated as pinned memory.
But my programs are using std::map, so I couldn't make this std::map data to pinned memory.
So instead of using this, I made a buffer array typed pinned memory and this buffer can always handle all the case of copying vector.
Finally, my program worked like this.
- Memcpy (copy data from std::map to buffer using loop until whole data is copied to buffer)
- cudaMemcpyAsync( ... , cudaMemcpyHostToDevice)
- kernel(kernel function only can be executed when whole data is copied to GPU global memory)
- cudaMemcpyAsync( ... , cudaMemcpyDeviceToHost)
- repeat 1~4steps 1000times (for another data(vector) )
And my program became much faster than the previous case.
But problem(my curiosity) is at this point.
I tried to make another program in a similar way.
- Memcpy (copy data from std::map to buffer only for one vector)
- cudaMemcpyAsync( ... , cudaMemcpyHostToDevice)
- loop 1~2 until whole data is copied to GPU global memory
- kernel(kernel function only can be executed when necessary data is copied to GPU global memory)
- cudaMemcpyAsync( ... , cudaMemcpyDeviceToHost)
- repeat 1~5steps 1000times (for another data(vector) )
This method came out to be about 10% faster than the method discussed above.
But I don't know why.
I think cudaMemcpyAsync only can be overlapped with kernel function.
But my case I think it is not. Rather than it looks like can be overlapped between cudaMemcpyAsync functions.
Sorry for my long question but I really want to know why.
Can Someone teach or explain to me what is the exact facility "cudaMemcpyAsync" and what functions can be overlapped with "cudaMemcpyAsync" ?