0
votes

I have a kernel which produces an array of result values and I want to find the maximum of these values efficiently. The Array is initialized in the beginning of the kernel with some negative value (for example -1). The Kernel is executed using 5 blocks each with 256 threads, for example.

Here are the problems:

  1. Because of my data, i must terminate threads, that are not valid, so I am working sometimes with 256 threads, sometimes 50, 20 and so on.

  2. In shared memory are written results from block, but as I mentioned, some array has 50 results, some has 256 results...(so shared array looks like this) 8,6,4,9,1,-1,-1,-1...

  3. In that case how to efficiently find the maximum in one block ?

Parallel reduction will be complicated on these types of array, isn't it ? How to do this ?

2
What do you mean with 'terminate' threads? Do a 'return' for that thread? Because that is not recommendablepQB
Have you considered using one of the free libraries available for finding the max value (e.g. ArrayFire)?arrayfire

2 Answers

2
votes

There is not enough information about your algorithm.

What do you mean by n results? Are the ignored values in the array set to -1 or do you use dynamic extern shared memory and threads only write up to index n (sounds hard to realize)?

  • Use a fixed size shared memory and set ignored values to -1 and still perform parallel reduction, if you search positive maximum and its filled with -1 it doesn't matter, or

  • Don't terminate threads, instead set a flag in another array if thread should not perform further calculations and still do parallel reduction

0
votes

You can either do the reduction within your kernel (as djmj said) or you could use thrust to combine the functor and reduction (e.g. transform_reduce). Thrust is included with the CUDA Toolkit, see this page for an example of transform_reduce.