What is the fastest way to move data that is on the device around in CUDA?
What I need to do is basically copy continuous sub-rows and sub-columns (of which I have the indexes on the device) from row-major matrices into new smaller matrices, but from what I've observed, memory access in CUDA is not particularly efficient, as it seems the cores are optimized to do computation rather that memory stuff.
Now the CPU seems to be pretty good at doing sequential stuff like moving rows of aligned memory from a place to another.
I see three options:
- make a kernel that does the memory copying
- outside a kernel, call cudaMemcpy(.., device to device) for each position (terribly slow for columns I would guess)
- move the memory to the host, create the new smaller matrix and send it back on the device
Now I could test this on my specific gpu, but given its specs I don't think it would be representative. In general, what is recommended?
Edit:
I'm essentially multiplying two matrices A,B but I'm only interested in multiplying the X elements:
A =[[XX XX]
[ XX XX ]
[XX XX ]]
with the corresponding elements in the columns of B. The XX are always of the same length and I know their positions (and there's a fixed number of them per row).
lda,ldb,ldcarguments in addition tom,n,k) and can even do an implicit transpose of the source matrices. It is not clear how exactly you are constructing the input matrices or how big they are; maybe even CUSPARSE would be applicable. It would be helpful if you could show code that demonstrates what you are doing, otherwise the question as-is appears to broad and will invite handwavy opinions rather than a aolsid answer. - njuffa