I am working on big datasets that are image cubes (450x450x1500). I have a kernel that works on individual data elements. Each data element produces 6 intermediate results (floats). My block consists of 1024 threads. The 6 intermediate results are stored in shared memory by each thread (6 float arrays). However, now I need to add each of the intermediate result to produce a sum (6 sum values). I do not have enough global memory to save these 6 float arrays to global memory and then run a reduction from thrust or any other library from the host code.
Are there any reduction routines that can be called from inside a kernel function on arrays in shared memory?
What will be the best way to solve this problem? I am a newbie to CUDA programming and would welcome any suggestions.
image[1500][450][450]or pixel after pixel likeimage[450][450][1500]- kangshiyin