0
votes

I have a basic question on coalesced cuda access.

For example, I have an Array of 32 Elements and 32 threads, each thread accesses one element.

__global__ void co_acc ( int A[32], int B[32] ) {
    int inx = threadIdx.x + (gridDim.x * blockDim.x);
    B[inx] = A[inx]
}

Now, what I want to know: If I have the 32 threads, but an array of 64 elements, each thread has to copy 2 elements. To keep a coalesced access, I should shift the index for the array access by the number of threads I have. eg: Thread with ID 0 will access A[0] and A[0+32]. Am I right with this assumption?

__global__ void co_acc ( int A[64], int B[64] ) {
    int inx = threadIdx.x + (gridDim.x * blockDim.x);
    int actions = 64/blockDim.x;
    for ( int i = 0; i < actions; ++i )
        B[inx+(i*blockDim.x)] = A[inx+(i*blockDim.x)]
}
1
Yes, both iterations of the for-loop in your second example should result in coalesced access for both the read from A and the write to B. Is that the extent of your question? (There are other possible ways to access all 64 elements in a coalesced fashion, and running a threadblock of 32 threads = 1 warp may not be the most efficient use of the GPU.) - Robert Crovella
Yes, this was my question, thank your very much. I had some problems with my code and wanted to make sure, that I got the concept right, and that my code is the problem and not my understanding of coalsced memory access. The numbers used in this questions are only examples to clarify my question. - lamchob

1 Answers

2
votes

To keep a coalesced access, I should shift the index for the array access by the number of threads I have. eg: Thread with ID 0 will access A[0] and A[0+32]. Am I right with this assumption?

Yes, that's a correct approach.

Strictly speaking it's not should but rather could: any memory access will be coalesced as long as all threads within a warp request addresses that fall within the same (aligned) 128 byte line. This means you could permute the thread indices and your accesses would still be coalesced (but why do complicated when you can do simple).

Another solution would be to have each thread load an int2:

__global__ void co_acc ( int A[64], int B[64] ) {
    int inx = threadIdx.x + (gridDim.x * blockDim.x);

    reinterpret_cast<int2*>(B)[inx] = reinterpret_cast<int2*>(A)[inx];
}

This is (in my opinion) simpler and clearer code, and might give marginally better performance as this may reduce the number of instructions emitted by the compiler and the latency between memory requests (disclaimer: I have not tried it).

Note: as Robert Crovella has mentioned in his comment, if you really are using thread blocks of 32 threads, then you are likely seriously underusing the capacity of your GPU.