CUDA shared memory - sum reduction from kernel

Question

I am working on big datasets that are image cubes (450x450x1500). I have a kernel that works on individual data elements. Each data element produces 6 intermediate results (floats). My block consists of 1024 threads. The 6 intermediate results are stored in shared memory by each thread (6 float arrays). However, now I need to add each of the intermediate result to produce a sum (6 sum values). I do not have enough global memory to save these 6 float arrays to global memory and then run a reduction from thrust or any other library from the host code.

Are there any reduction routines that can be called from inside a kernel function on arrays in shared memory?

What will be the best way to solve this problem? I am a newbie to CUDA programming and would welcome any suggestions.

By '6 sum values', do you mean your final result only contains 6 floats or you will do reduction on only 6 floats for 450x450x1500 times? — kangshiyin
The final result contains 6 floats. The sum is over the third dimension (1500). So I finally need to end up with 450x450x6 floats. — user2789280
So how is your image cube stored in global mem? Frame after frame like image[1500][450][450] or pixel after pixel like image[450][450][1500] — kangshiyin
It is first along the z then along x and y. So image[1500][450][450]. I have a thread processing each voxel. Since a block cannot have 1500 threads, I am using 512 threads per block and splitting the 1500 into three blocks. I will need to eventually accumulate results from all three blocks. I am thinking that I will use temporary global memory (450x450x6) to save the intermediate sum values from each block. Is that a good way to do this? — user2789280
You don't need to sum one 1500-D vector using multiple blocks or multiple theads. Using one thread is enough for your case. See my updated answer. — kangshiyin

Robert Crovella Robert Crovella · Accepted Answer · 2013-09-17T21:54:59

This seems unlikely:

I do not have enough global memory to save these 6 float arrays to global memory and then run a reduction from thrust or any other library from the host code.

I can't imagine how you have enough space to store your data in shared memory but not in global memory.

Anyway, CUB provides reduction routines that can be called from within a threadblock, and that can operate on data stored in shared memory.

Or you can write your own sum-reduction code. It's not terribly hard to do, there are many questions on SO about it, such as this one.

Or you could adapt the cuda sample code.

CUDA shared memory - sum reduction from kernel

4 Answers

Update