4
votes

Is there a way to intersect a single ray with a SIMD-pack of 8 triangles such that I don't have to use store or shuffle or any such slow instructions? My main issue is the final part of the intersection where I find which of the 8 triangles in the pack is nearest to the ray; I'm storing and then getting the minimum t-value, basically a horizontal min.

Moreover, is this pattern correct? I'm using an 8 way BVH with single ray traversal as described in the "Stackless Multi-BVH Traversal for CPU, MIC and GPU Ray Tracing" paper on top of which I've added single ray to bundles of triangles intersection. Would ray-bundle intesected with single-triangle, coupled with binary BVH be more suitable?

Thank you.

2
That horizontal minimum can be done with some shuffles and vertical minimums, it's a bit annoying but not too bad. If you post the current code, I can see if I can edit it in.harold
@harold, the thing is I'm trying to avoid shuffles and stores as I've read they're pretty slow.user2199797
They're slower than not doing them, but there is no instruction for horizontal minimum (except for uint16's) so there aren't that many options..harold

2 Answers

1
votes

The difficulty with bundles of rays is how do you get high coherency of rays.

With primary rays, its not too bad since neighboring pixels will generate highly coherent rays. With shadow rays, it'll be a little complicated but doable with point lights or relatively small area lights.

Any other secondary rays become a pain to deal with. If you randomly choose rays, then they will easily all hit different triangles meaning you end up doing the equivalent single ray intersections, but with SIMD overhead. From what I've seen using SIMD ray bundles has not yet shown good promise, even when rearranging ray bundles to try to achieve good coherency.

If you do want to try it one possible paper to look at is "Large Ray Packets for Real Time Whitted Ray Tracing". They look at ray tracing using ray packets, and have some references for other ray packet techniques.

0
votes

You basically have to do this horizontal reduction, using for instance one _mm256_min_ps, one _mm_min_ps, and finally two _mm_min_ss with a few shuffles to put the results in the right lanes. This is already better than doing 8 _mm_min_ss with 8 shuffles.

The other option is to keep the register containing the 8 t parameters and do the reduction only at the end of the traversal loop. But that means loosing the ability to cull BVH nodes precisely. Therefore I would recommend the first approach.