I need to do calculation like: A[x][y] = sum{from z=0 till z=n}{B[x][y][z]+C[x][y][z]}, where matrix A has dimensions [height][width] and matrix B,C has dimensions [height][width][n].
Values are mapped to memory with something like:
index = 0;
for (z = 0; z<n; ++z)
for(y = 0; y<width; ++y)
for(x = 0; x<height; ++x) {
matrix[index] = value;
index++;
}
I would like to each block calculate one sum since each block has own shared memory. To avoid data racing I use atomicAdd, something like this:
Part of code in global memory:
dim3 block (n, 1, 1);
dim grid (height, width, 1);
Kernel:
atomicAdd( &(A[blockIdx.x + blockIdx.y*gridDim.y]),
B[blockIdx.x + blockIdx.y*gridDim.y+threadIdx.x*blockDim.x*blockDim.y]
+ C[blockIdx.x + blockIdx.y*gridDim.y+threadIdx.x*blockDim.x*blockDim.y] );
I would like to use shared memory for calculating the sum and then copy this result to global memory.
I am not sure how to do the part with shared memory. In each block“s shared memory will be stored just one number ( sum result ). How should I copy this number to right place in A matrix in global memory?