CUDA sum across blocks

Question

Hello I am new to cuda programming and I got a problem.

I have a variable, let's call foo stored in the shared memory of each block with different value from one block to another. And I want only one thread to sum all of them across blocks. I thought to send foo to global memory then compute the sum, but is there any function which can do this more quickly?

Thanks for your help.

einpoklum einpoklum · Accepted Answer · 2018-11-01T17:34:19

It would be faster to have one thread in each block perform an atomicAdd() operation, adding the per-block-value to a single, grid-wide variable in global memory.

See the relevant section of the CUDA C Programming guide.

For a deeper exploration of optimizing reductions (= summation), albeit not necessarily the one you want to perform, have a look at Mark Harris' presentation: Optimizing Parallel Reduction in CUDA.

CUDA sum across blocks

1 Answers