2
votes

I am working on big datasets that are image cubes (450x450x1500). I have a kernel that works on individual data elements. Each data element produces 6 intermediate results (floats). My block consists of 1024 threads. The 6 intermediate results are stored in shared memory by each thread (6 float arrays). However, now I need to add each of the intermediate result to produce a sum (6 sum values). I do not have enough global memory to save these 6 float arrays to global memory and then run a reduction from thrust or any other library from the host code.

Are there any reduction routines that can be called from inside a kernel function on arrays in shared memory?

What will be the best way to solve this problem? I am a newbie to CUDA programming and would welcome any suggestions.

4
By '6 sum values', do you mean your final result only contains 6 floats or you will do reduction on only 6 floats for 450x450x1500 times? - kangshiyin
The final result contains 6 floats. The sum is over the third dimension (1500). So I finally need to end up with 450x450x6 floats. - user2789280
So how is your image cube stored in global mem? Frame after frame like image[1500][450][450] or pixel after pixel like image[450][450][1500] - kangshiyin
It is first along the z then along x and y. So image[1500][450][450]. I have a thread processing each voxel. Since a block cannot have 1500 threads, I am using 512 threads per block and splitting the 1500 into three blocks. I will need to eventually accumulate results from all three blocks. I am thinking that I will use temporary global memory (450x450x6) to save the intermediate sum values from each block. Is that a good way to do this? - user2789280
You don't need to sum one 1500-D vector using multiple blocks or multiple theads. Using one thread is enough for your case. See my updated answer. - kangshiyin

4 Answers

2
votes

This seems unlikely:

I do not have enough global memory to save these 6 float arrays to global memory and then run a reduction from thrust or any other library from the host code.

I can't imagine how you have enough space to store your data in shared memory but not in global memory.

Anyway, CUB provides reduction routines that can be called from within a threadblock, and that can operate on data stored in shared memory.

Or you can write your own sum-reduction code. It's not terribly hard to do, there are many questions on SO about it, such as this one.

Or you could adapt the cuda sample code.

1
votes

Update

After seeing all the comments, I understand that instead of doing 1 or a few times of reduction, you need to do the reductions for 450x450x6 times.

In this case there's simpler solution.

You don't need to implement relatively complex parallel reduction for each 1500-D vector。 Since you already have 450x450x6 vectors to reduce, you could reduce all these vectors in parallel using traditional serial reduction method.

You could use a block with 16x16 threads to process a particular region of the image, and a grid with 29x29 blocks to cover the whole 450x450 image.

In each thread, you could iterate over the 1500 frames. In each iterration, you coulde first compute the 6 intermediate results, then add them to the sums. When yo finish all the iterations, you could write the 6 sums to global mem.

That finishes the kernel design. And no shared mem is needed.

You wil find that the performance is very good. Since it is a memory bound operation,it won't be much longer than simply access all the image cube data once.

In case you don't have enough global mem for the whole cube, you could split it into 4 sub-cubes of [1500][225][225], and call the kernel routine on each sub-cube. The only thing you need to change is the grid size.

0
votes

Have a look at this that explains parallel reduction in CUDA thoroughly.

0
votes

If I understand it correctly, each thread should sum up "only" 6 floats.

I'm not sure if it is worth doing that by a parallel reduction in general, in the sense that you will experience performance gains.

If you are targeting a Kepler, you may try to use shuffle operations if you properly set the block size so that your intermediate results fit the Streaming Multiprocessor's registers in some way.

As also pointed out by Robert Crovella, your statement about the possibility of storing the intermediate results seems strange as the amount of global memory is certainly larger than the amount of shared memory.