It's not 100% clear what you are trying to do. If you are trying to copy data from global to shared memory then presumably it has some structure, e.g. an array of char
s or float
s or something. The following answer will assume that you are dealing with an array of char
s (you can replace char
with any data type).
Summary: don't think about explicitly accessing 32/64/128 bytes at a time, just write code such that the memory accesses can be coalesced.
You can access data however you want with CUDA, just like in normal C/C++/whatever. You can even get down to individual bytes. What the programming guide is saying is whenever data is accessed a block of 32/64/128-bytes has to be read. E.g. if you have char a[128]
and want to get a[17]
then the GPU will have to read from a[0]
to a[31]
to be able to get the data in a[17]
. This happens transparently, as in, you don't need to code any differently to be able to access individual bytes.
The major consideration is memory access speed: if one has to read 31 junk bytes for each information byte then you are reducing your effective memory bandwidth by a factor of 32 (and also meaning that you have to do many more global memory accesses, which are sloowww)!
However, memory access on GPUs can be "coalesced" across threads in a block (this question gives a reasonable starting point for optimising for coalescing.). Briefly, coalescing allows memory accesses that happen simultaneously for several threads in a block can be "batched" together so that only one read needs to happen.
The point of this is the coalescing happens across threads within a block (not within a single thread), so for a copy-into-shared-memory one could do (array
is an array of char
s in global memory):
__shared__ char shrd[SIZE];
shrd[threadIdx.x] = array[blockDim.x * blockIdx.x + threadIdx.x];
__syncthreads();
This will get each thread to copy one byte into the shared array. This memcpy operation essentially happens in parallel, and the data accesses are coalesced so there is no wasted bandwidth (or time).
The above strategy is much much better than getting a single thread to iterate through and copy byte-by-byte.
One could also treat a each block of n bytes of the array as a single n-byte data type and get each thread to copy that. e.g. for n==16, do some casts to uint4
__shared__ char shrd[SIZE];
((uint4*)shrd)[threadIdx.x] = ((uint4*)array)[blockDim.x * blockIdx.x + threadIdx.x];
__syncthreads();
This would allow each thread to copy 16 bytes at once. Notes on that bit of code:
- I haven't tested or benchmarked it
- I don't know if it is good practice (I would strongly expect that it's not).)
- The indices are scaled by 16 (e.g.
threadIdx.x == 1
corresponds to writing to shrd[16],shrd[17],...,shrd[31]
)
As an side note: depending on your specific use-case, the built-in cudaMemcpy functions might be useful.