I coded a CUDA kernel that performs an array addition of two arrays arr1
and arr2
. The information, which index of arr1
should be added with which index of arr2
is stored in an array idx
.
Here is a code example:
__global__ add(float* arr1, float* arr2, int* idx, int length)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
// each thread performs (length) additions,
// arr2 is (lenght) times larger than arr1
for (int j=threadIdx.x; j<length*blockDim.x; j+=blockDim.x)
{
arr1[i] += arr2[ idx[blockIdx.x*blockDim.x + j] ]; // edited here
}
}
The code yields the correct output, though is hardly faster than a openmp-parallel code on the CPU with 8 threads. I tried this for different block-sizes.
I suspect that the access pattern to arr2
is inefficient, since arr2
is in global memory and is accessed quasi-randomly -- the array idx
contains unique, sorted, but non-contiguous indices (could be 2, 3, 57, 103, ...). Therefor, the L1 cache is not taken advantage of.
Also, the arrays are very large and cannot entirely fit in shared memory.
Is there any way around this obstacle?
Do you have an idea on how to optimize the access pattern to arr2
?
arr2
is the bottleneck? What about the completely uncoalesced acceess toidx
? – talonmieslength
contiguous elements ofidx
, whereas there is no order in the access pattern ofarr2
whatsoever. – Julianidx
should now be coalesced and my code runs twice as fast. Now I'm pretty sure that the access toarr2
is the bottleneck. – Julian