0
votes

Hello I am new to cuda programming and I got a problem.

I have a variable, let's call foo stored in the shared memory of each block with different value from one block to another. And I want only one thread to sum all of them across blocks. I thought to send foo to global memory then compute the sum, but is there any function which can do this more quickly?

Thanks for your help.

1

1 Answers

2
votes

It would be faster to have one thread in each block perform an atomicAdd() operation, adding the per-block-value to a single, grid-wide variable in global memory.

See the relevant section of the CUDA C Programming guide.

For a deeper exploration of optimizing reductions (= summation), albeit not necessarily the one you want to perform, have a look at Mark Harris' presentation: Optimizing Parallel Reduction in CUDA.