0
votes

I am developing a Monte Carlo ray tracer in OpenCl for calculating 'view factors' for radiative heat transfer analysis and wish to know the optimal way to collate the number of times object x is intersected by rays fired from object i.

Now the basic algorithm is as follows :

  1. Fire random ray, R, off from the surface of object i
  2. Test intersection of ray R with objects 0 - N
  3. Determine the first object intersected by R, let this be object x
  4. Record the first intersection by incrementing an array of int's such that array[i][x] +=1
  5. Repeat for total number of rays
  6. Divide each value in the ith row of the array by the total number of arrays fired from object i.

Now typically in a parallel implementation on the CPU each thread would just maintain its own copy of array[N] and then when all rays have been fired from object i, the master thread will sum the individual arrays to get the results.

In OpenCL on the GPU this is not a practical solution as when N increases there quickly becomes a shortage of local memory and using a single array with barriers will cripple performance.

What is the best practical for preforming reduction of the results array or is a a memory barrier the only practical solution?

1

1 Answers

0
votes

Kernel A) Do it as you described, writing their own copies of array[N] back to global memory at the end.

Kernel B) A kernel for reduction (that the "master thread" does in your description).

Make all the invocations of kernel A set events upon completion. Make the invocation(s) of kernel B depend on those events. You could even use multiple invocations of kernel B to do a pyramid reduction, taking two (or more, tuning may be needed) arrays and writing out one sum. A chain of events could resolve the dependencies and allow deep queuing.