2
votes

CUDA C Programming Guide (p.70) says,

Global memory resides in device memory and device memory is accessed via 32-, 64-, or 128-byte memory transactions. These memory transactions must be naturally aligned: Only the 32-, 64-, or 128-byte segments of device memory that are aligned to their size (i.e. whose first address is a multiple of their size) can be read or written by memory transactions.

So, if I want to access 32, 64 or 128 continuous byte at one time in device function, (for copy to shared memory for example) what is the most suitable function (or assignment) for this operation ?

The traditional C memcpy function seems do not access 32 bytes at one time (It is very slow). And because this is not a vector data, I want a single thread read this data at one time.


To dbaupp

memcpy works well, but I'm talking about speed. For example, assume I have device memory pointer p and run following code in device function.

a) char c[8]; memcpy(c, p, 8);

b) char c[8]; * (double * )c = * (double * )p;

For above two case, the result is same but case b is nearly 8 times faster than case a (I tested and confirmed in my code).

And FYI, cudaMemcpy function does not works in device function.

So, what I want to know is if there is any way to copy 16 bytes from in single operation. (hopefully 16 times faster than memcpy(c, p, 16); )

1
If you want to load 16 bytes, use one of the CUDA vector types, like uint4.talonmies
@user727062, you should've commented on my answer so that I got a notification. And did you read what I said about coalesced memory access across threads? That is precisely why memcpy is so slow and why you shouldn't use it in device code. (And I even gave you an example of copying 16 bytes at a time.)huon

1 Answers

6
votes

It's not 100% clear what you are trying to do. If you are trying to copy data from global to shared memory then presumably it has some structure, e.g. an array of chars or floats or something. The following answer will assume that you are dealing with an array of chars (you can replace char with any data type).

Summary: don't think about explicitly accessing 32/64/128 bytes at a time, just write code such that the memory accesses can be coalesced.


You can access data however you want with CUDA, just like in normal C/C++/whatever. You can even get down to individual bytes. What the programming guide is saying is whenever data is accessed a block of 32/64/128-bytes has to be read. E.g. if you have char a[128] and want to get a[17] then the GPU will have to read from a[0] to a[31] to be able to get the data in a[17]. This happens transparently, as in, you don't need to code any differently to be able to access individual bytes.

The major consideration is memory access speed: if one has to read 31 junk bytes for each information byte then you are reducing your effective memory bandwidth by a factor of 32 (and also meaning that you have to do many more global memory accesses, which are sloowww)!

However, memory access on GPUs can be "coalesced" across threads in a block (this question gives a reasonable starting point for optimising for coalescing.). Briefly, coalescing allows memory accesses that happen simultaneously for several threads in a block can be "batched" together so that only one read needs to happen.

The point of this is the coalescing happens across threads within a block (not within a single thread), so for a copy-into-shared-memory one could do (array is an array of chars in global memory):

__shared__ char shrd[SIZE];

shrd[threadIdx.x] = array[blockDim.x * blockIdx.x  + threadIdx.x];
__syncthreads();

This will get each thread to copy one byte into the shared array. This memcpy operation essentially happens in parallel, and the data accesses are coalesced so there is no wasted bandwidth (or time).

The above strategy is much much better than getting a single thread to iterate through and copy byte-by-byte.

One could also treat a each block of n bytes of the array as a single n-byte data type and get each thread to copy that. e.g. for n==16, do some casts to uint4

__shared__ char shrd[SIZE];

((uint4*)shrd)[threadIdx.x] = ((uint4*)array)[blockDim.x * blockIdx.x  + threadIdx.x];
__syncthreads();

This would allow each thread to copy 16 bytes at once. Notes on that bit of code:

  • I haven't tested or benchmarked it
  • I don't know if it is good practice (I would strongly expect that it's not).)
  • The indices are scaled by 16 (e.g. threadIdx.x == 1 corresponds to writing to shrd[16],shrd[17],...,shrd[31])

As an side note: depending on your specific use-case, the built-in cudaMemcpy functions might be useful.