4
votes

If I transfer a single byte from a CUDA kernel to PCI-E to the host (zero-copy memory), how much is it slow compared to transferring something like 200 Megabytes?

What I would like to know, since I know that transferring over PCI-E is slow for a CUDA kernel, is: does it change anything if I transfer just a single byte or a huge amount of data? Or perhaps since memory transfers are performed in "bulks", transferring a single byte is extremely expensive and useless with respect to transferring 200 MBs?

2
The bandwidth test example which has shipped with CUDA forever is specifically designed to answer this question.talonmies
I currently don't have a CUDA gpu right now, can you give me a hint on the results?Marco A.
This has to do with the overhead of launching a transfer request. For example 200 1MB requests will be slower than a single 200MB transfer.Pavan Yalamanchili
If u have large data to be transferred to the GPU for processing.. then its best to look into following concepts 1) streams and 2) async copy.. here is code for checking the bandwidth u might want to look into it..Sagar Masuti

2 Answers

7
votes

Hope this pic explain everything. The data is generated by bandwidthTest in CUDA samples. The hardware environment is PCI-E v2.0, Tesla M2090 and 2x Xeon E5-2609. Please note both axises are in log scale.

Given this figure, we can see that the overhead of launching a transfer request takes a constant time. Regression analysis on the data gives an estimated overhead time of 4.9us for H2D, 3.3us for D2H and 3.0us for D2D.

enter image description here

-1
votes

The latency plot would be more clear in this case. Small transactions aren't more expensive than big ones. The only problem with them is that they can't saturate the bus. Therefore it's possible to transfer bigger messages at almost the same time. That is why transferring one 512 KB is 120 times faster than transferring 512 1 KB transactions. The saturation point of PCIe depends on lanes count. You could find more details about PCIe features from CUDA point of view here.