I have a basic question on coalesced cuda access.
For example, I have an Array of 32 Elements and 32 threads, each thread accesses one element.
__global__ void co_acc ( int A[32], int B[32] ) {
int inx = threadIdx.x + (gridDim.x * blockDim.x);
B[inx] = A[inx]
}
Now, what I want to know: If I have the 32 threads, but an array of 64 elements, each thread has to copy 2 elements. To keep a coalesced access, I should shift
the index for the array access by the number of threads I have.
eg: Thread with ID 0 will access A[0]
and A[0+32]
. Am I right with this assumption?
__global__ void co_acc ( int A[64], int B[64] ) {
int inx = threadIdx.x + (gridDim.x * blockDim.x);
int actions = 64/blockDim.x;
for ( int i = 0; i < actions; ++i )
B[inx+(i*blockDim.x)] = A[inx+(i*blockDim.x)]
}
A
and the write toB
. Is that the extent of your question? (There are other possible ways to access all 64 elements in a coalesced fashion, and running a threadblock of 32 threads = 1 warp may not be the most efficient use of the GPU.) - Robert Crovella