I'm trying to sum an array with this code and I am stuck. I probably need some "CUDA for dummies tutorial", because I spent so much time with such basic operation and I can't make it work.
Here is a list of things I don't understand or I'm unsure of:
What number of blocks (dimGrid) should I use? I think that should be
N/dimBlock.x/2
(N=length of input array), because at the beginning of the kernel, data are loaded and added to shared memory from two "blocks" of global memoryIn original code there was
blockSize
. I replaced it withblockDim.x
because I don't know how these variables differ. But whenblockSize
=blockDim.x
, thengridSize = blockDim.x*2*gridDim.x
doesn't make sense to me -gridSize
will be greater than N. What is the difference between *Dim.x and *Size in a context of 1D array?Main logic - in kernel, each block sums 2*dimBlock(threads in block) numbers. When N = 262144 and dimBlock = 128, kernel returns 1024 array of partial sums. Then I run kernel again, result = 4 partial sums. Finally, in last run, single sum is returned, because array is processed by single block.
I sum binary array. In the first run, I can use
uchar4
for input data. In second and third run, I will useint
.
Tell me please what am I missing
Thanks
__global__ void sum_reduction(uchar4* g_idata, int* g_odata, int N) {
extern __shared__ int s_data[];
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*(blockDim.x*2) + tid;
unsigned int gridSize = blockDim.x*2*gridDim.x;
while (i < N) {
s_data[tid] += g_idata[i].x + g_idata[i+blockDim.x].x +
g_idata[i].y + g_idata[i+blockDim.x].y +
g_idata[i].z + g_idata[i+blockDim.x].z +
g_idata[i].w + g_idata[i+blockDim.x].w;
i += gridSize;
}
__syncthreads();
if (tid < 64) {
s_data[tid] += s_data[tid + 64];
}
__syncthreads();
if (tid < 32) {
volatile int *s_ptr = s_data;
s_ptr[tid] += s_ptr[tid + 32];
s_ptr[tid] += s_ptr[tid + 16];
s_ptr[tid] += s_ptr[tid + 8];
s_ptr[tid] += s_ptr[tid + 4];
s_ptr[tid] += s_ptr[tid + 2];
s_ptr[tid] += s_ptr[tid + 1];
}
if (tid == 0) {
g_odata[blockIdx.x] = s_data[0];
}
}
main{
...
dim3 dimBlock(128);
dim3 dimGrid(N/dimBlock.x);
sum_reduction<<<dimGrid, dimBlock>>>(in, out, N);
...
}
int smemSize = 128 * sizeof(int);
and called kernel with smemSize parameter. Result is correct. Thanks PS I didn't have runtime errors before. Initialisation of memory didn't affect the result (now it does). – eel