I have a kernel which produces an array of result values and I want to find the maximum of these values efficiently. The Array is initialized in the beginning of the kernel with some negative value (for example -1). The Kernel is executed using 5 blocks each with 256 threads, for example.
Here are the problems:
Because of my data, i must terminate threads, that are not valid, so I am working sometimes with 256 threads, sometimes 50, 20 and so on.
In shared memory are written results from block, but as I mentioned, some array has 50 results, some has 256 results...(so shared array looks like this) 8,6,4,9,1,-1,-1,-1...
In that case how to efficiently find the maximum in one block ?
Parallel reduction will be complicated on these types of array, isn't it ? How to do this ?