Moving memory around on device in CUDA

Question

What is the fastest way to move data that is on the device around in CUDA?

What I need to do is basically copy continuous sub-rows and sub-columns (of which I have the indexes on the device) from row-major matrices into new smaller matrices, but from what I've observed, memory access in CUDA is not particularly efficient, as it seems the cores are optimized to do computation rather that memory stuff.

Now the CPU seems to be pretty good at doing sequential stuff like moving rows of aligned memory from a place to another.
I see three options:

make a kernel that does the memory copying
outside a kernel, call cudaMemcpy(.., device to device) for each position (terribly slow for columns I would guess)
move the memory to the host, create the new smaller matrix and send it back on the device

Now I could test this on my specific gpu, but given its specs I don't think it would be representative. In general, what is recommended?

Edit:

I'm essentially multiplying two matrices A,B but I'm only interested in multiplying the X elements:

A =[[XX      XX]
    [  XX  XX  ]
    [XX  XX    ]]

with the corresponding elements in the columns of B. The XX are always of the same length and I know their positions (and there's a fixed number of them per row).

Regardless of platform, it is usually best to avoid moving data around. Moving data mostly consumes a lot of energy. It is better to incorporate the data movement into actual processing. For example, CUBLAS API functions typically allow programmers to specify individual (sub-)vectors and sub-matrices. Most functions taking vectors also allow to specify a stride between vector elements. If you can't find CUBLAS functions that meet your need, pull the data from the appropriate locations during your own custom processing. Regular access patterns will lead to high memory bandwidth (100s of GB/sec) — njuffa
Well the ultimate goal is to do a gemm of the resulting matrices, and since I don't think I have the skills to rewrite such an operation using only the required subparts of the matrices, I feel like it would be better to copy and then gemm rather than do my own inefficient matrix multiply. — Manux
The ?GEMM functions in CUBLAS can all operate on submatrices (that is why there are lda, ldb, ldc arguments in addition to m, n, k) and can even do an implicit transpose of the source matrices. It is not clear how exactly you are constructing the input matrices or how big they are; maybe even CUSPARSE would be applicable. It would be helpful if you could show code that demonstrates what you are doing, otherwise the question as-is appears to broad and will invite handwavy opinions rather than a aolsid answer. — njuffa
It is almost never necessary to copy data in the way you are asking about. Pointer arithmetic and exchange covers 95% of real world cases when working with dense matrices and BLAS/LAPACK style linear algebra libraries. — talonmies

Robert Crovella Robert Crovella · Accepted Answer · 2015-03-26T20:30:03

If you have a matrix storage pattern that involves varying spacing between corresponding row elements (or corresponding column elements), none of the input transformation or striding capabilities of cublas will help, and none of the api striding-copy functions (such as cudaMemcpy2D) will help.

You'll need to write your own kernel to gather the data, before feeding it to cublasXgemm. This should be fairly trivial to do, if you have the locations of the incoming data elements listed in a vector or otherwise listed.

Moving memory around on device in CUDA

1 Answers