2
votes

I'm trying to sum an array with this code and I am stuck. I probably need some "CUDA for dummies tutorial", because I spent so much time with such basic operation and I can't make it work.

Here is a list of things I don't understand or I'm unsure of:

  1. What number of blocks (dimGrid) should I use? I think that should be N/dimBlock.x/2 (N=length of input array), because at the beginning of the kernel, data are loaded and added to shared memory from two "blocks" of global memory

  2. In original code there was blockSize. I replaced it with blockDim.x because I don't know how these variables differ. But when blockSize = blockDim.x, then gridSize = blockDim.x*2*gridDim.x doesn't make sense to me - gridSize will be greater than N. What is the difference between *Dim.x and *Size in a context of 1D array?

  3. Main logic - in kernel, each block sums 2*dimBlock(threads in block) numbers. When N = 262144 and dimBlock = 128, kernel returns 1024 array of partial sums. Then I run kernel again, result = 4 partial sums. Finally, in last run, single sum is returned, because array is processed by single block.

  4. I sum binary array. In the first run, I can use uchar4 for input data. In second and third run, I will use int.

Tell me please what am I missing

Thanks

__global__ void sum_reduction(uchar4* g_idata, int* g_odata, int N) { 

extern __shared__ int s_data[]; 

unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*(blockDim.x*2) + tid;
unsigned int gridSize = blockDim.x*2*gridDim.x;

while (i < N) {
    s_data[tid] += g_idata[i].x + g_idata[i+blockDim.x].x +
            g_idata[i].y + g_idata[i+blockDim.x].y +
            g_idata[i].z + g_idata[i+blockDim.x].z +
            g_idata[i].w + g_idata[i+blockDim.x].w;
    i += gridSize;
}
__syncthreads();

if (tid < 64) {
    s_data[tid] += s_data[tid + 64];
}
__syncthreads(); 

if (tid < 32) { 
    volatile int *s_ptr = s_data; 
    s_ptr[tid] += s_ptr[tid + 32];
    s_ptr[tid] += s_ptr[tid + 16];
    s_ptr[tid] += s_ptr[tid + 8]; 
    s_ptr[tid] += s_ptr[tid + 4];
    s_ptr[tid] += s_ptr[tid + 2]; 
    s_ptr[tid] += s_ptr[tid + 1]; 
} 
if (tid == 0) {
    g_odata[blockIdx.x] = s_data[0];
} 
}


main{
...
dim3 dimBlock(128);
dim3 dimGrid(N/dimBlock.x);
sum_reduction<<<dimGrid, dimBlock>>>(in, out, N);
...
}
2
The kernel as written has a fairly major bug in it - the main summation loop is using uninitialised memory, so the results should be completely unpredictable. Also, the kernel launch is lacking a shared memory size which should cause a kernel runtime error (or is that just a typo?).talonmies
@talonmies You saved me again! I added int smemSize = 128 * sizeof(int); and called kernel with smemSize parameter. Result is correct. Thanks PS I didn't have runtime errors before. Initialisation of memory didn't affect the result (now it does).eel
Please add the answer to the question, rather than as a comment.harrism

2 Answers

4
votes

Calling the kernel like this fixes the problem.

dim3 dimBlock(128);
dim3 dimGrid(N/dimBlock.x);
int smemSize = dimBlock.x * sizeof(int);
sum_reduction<<<dimGrid, dimBlock, smemSize>>>(in, out, N);    
-3
votes

Okay, I think you need to start fresh. Take a look into this step-by-step process guide from NVIDiA on reduction