CUDA inefficient access pattern to global memory

Question

I coded a CUDA kernel that performs an array addition of two arrays arr1 and arr2. The information, which index of arr1 should be added with which index of arr2 is stored in an array idx.

Here is a code example:

__global__ add(float* arr1, float* arr2, int* idx, int length)
{
    int i = blockIdx.x * blockDim.x + threadIdx.x;

    // each thread performs (length) additions,
    // arr2 is (lenght) times larger than arr1
    for (int j=threadIdx.x; j<length*blockDim.x; j+=blockDim.x)
    {
        arr1[i] += arr2[ idx[blockIdx.x*blockDim.x + j] ]; // edited here
    }
}

The code yields the correct output, though is hardly faster than a openmp-parallel code on the CPU with 8 threads. I tried this for different block-sizes.

I suspect that the access pattern to arr2 is inefficient, since arr2 is in global memory and is accessed quasi-randomly -- the array idx contains unique, sorted, but non-contiguous indices (could be 2, 3, 57, 103, ...). Therefor, the L1 cache is not taken advantage of. Also, the arrays are very large and cannot entirely fit in shared memory.

Is there any way around this obstacle? Do you have an idea on how to optimize the access pattern to arr2?

What makes you think access to arr2 is the bottleneck? What about the completely uncoalesced acceess to idx? — talonmies
Good Point. Still, one thread at least accesses length contiguous elements of idx, whereas there is no order in the access pattern of arr2 whatsoever. — Julian
Yes, thanks, I did some research and updated the question. The access to idx should now be coalesced and my code runs twice as fast. Now I'm pretty sure that the access to arr2 is the bottleneck. — Julian
There is no way that the edited code could produce the same results as the original in your question — talonmies

Barış Tanyeri Barış Tanyeri · Accepted Answer · 2018-05-04T08:55:03

Couple of things you can try doing here:

Global memory is slow you can try loading the arrays to shared memory for each block. So you would load the array partially for each block for example.
If there is anything that is defined as constant you should move that to the constant memory.
Try unrolling the loop.
Definitely try launching multiple kernels with different stream parameters as well as different block/thread combinations.

Hope this gives you an idea on your way to optimization :)

CUDA inefficient access pattern to global memory

1 Answers