Hello I am new to cuda programming and I got a problem.
I have a variable, let's call foo stored in the shared memory of each block with different value from one block to another. And I want only one thread to sum all of them across blocks. I thought to send foo to global memory then compute the sum, but is there any function which can do this more quickly?
Thanks for your help.