0
votes

I've launched a kernel with 2100 blocks and 4 threads per block.

Somewhat in this kernel all the threads have to execute a function, and put its result on an array (on global memory) into "threadIdx.x" position.

I surely know that, in this fase of the project, the function always returns 1.012086. Now, I've written this code to do that sum:

currentErrors[threadIdx.x]=0;
for(i=0;i<gridDim.x;i++)
{
    if(i==blockIdx.x)
    {
        currentErrors[threadIdx.x]+=globalError(mynet,myoutput);
    }
}

But when the kernel ends all array's position has 1.012086 as value (instead 1.012086*2100).

Where I'm wrong? Thanks for your helps!

1
This question and the previous questions you have asked strongly indicate that you are unclear on the very basics of CUDA. I think you should take a few steps back and read an introductory book on CUDA before attempting to continue with your project.Roger Dahl
Before posting this question I have already read this guide about the reduction, but did not find anything for my specific case. developer.download.nvidia.com/compute/cuda/1.1-Beta/x86_website/… Here there's no info about reduction between Cuda BlocksAndrea Sylar Solla
Andrea, that means that you didn't understand the paper. Follow my advice and all will become clear :)Roger Dahl

1 Answers

2
votes

To compute a final sum out of partial results of your blocks, I would suggest doing it the following way:

  • Let every block write a partial result into a separate cell of a gridDim.x-sized array.
  • Copy the array to host.
  • Perform final sum on the host.

I assume each block has a lot to compute on its own, which would warrant the usage of CUDA in the first place.

In your current state --- I think there can be something wrong in your kernel. Seems to me that every block is summing all the data, returning a final result as if it was a partial result.

The loop you presented does not really make sense. For each block there is only one i which will do something. The code will be equivalent to simply writing:

currentErrors[threadIdx.x]=0;
currentErrors[threadIdx.x]+=globalError(mynet,myoutput);

save for some unpredictable scheduling differences.

Remember that blocks are not executed in sync. Each block can run before, during or after any other block.


Also: