It is possible to do up to 16 triangles with AVX512, 8 with AVX2, and 4 with SSE. The trick though, is to ensure the data is in SOA format. The other trick is to not 'return false' at any point (just filter the results at the end). So you triangle input would look something like:
struct Tri {
__m256 e1[3];
__m256 e2[3];
__m256 v0[3];
};
And your ray would look like:
struct Ray {
__m256 dir[3];
__m256 pos[3];
};
The maths code then starts to look far nicer (be aware the _mm_dp_ps is not the fastest function ever written - and also be aware that accessing the internal implementation of the __m128/__m256/__m512 types is not portable).
#define or8f _mm256_or_ps
#define mul _mm256_mul_ps
#define fmsub _mm256_fmsub_ps
#define fmadd _mm256_fmadd_ps
void cross(__m256 result[3], const __m256 a[3], const __m256 b[3])
{
result[0] = fmsub(a[1], b[2], mul(b[1], a[2]));
result[1] = fmsub(a[2], b[0], mul(b[2], a[0]));
result[2] = fmsub(a[0], b[1], mul(b[0], a[1]));
}
__m256 dot(const __m256 a[3], const __m256 b[3])
{
return fmadd(a[2], b[2], fmadd(a[1], b[1], mul(a[0], b[0])));
}
You basically have 4 conditions in the method:
if (a > negativeEpsilon && a < positiveEpsilon)
if (u < 0.0f)
if (v < 0.0f || (u + v > 1.0f))
if (t < 0.0f || t > m_length)
If any of those conditions are true, then there is no intersection. That basically requires a little refactoring (in pseudo code)
__m256 condition0 = (a > negativeEpsilon && a < positiveEpsilon);
__m256 condition1 = (u < 0.0f)
__m256 condition2 = (v < 0.0f || (u + v > 1.0f))
__m256 condition3 = (t < 0.0f || t > m_length)
// combine all conditions that can cause failure.
__m256 failed = or8f(or8f(condition0, condition1), or8f(condition2, condition3));
So finally, if an intersection occurred, the result will be t. If an intersection DID NOT occur, then we need to set the result to something wrong (a negative number is possibly a good choice in this case!)
// if(failed) return -1;
// else return t;
return _mm256_blendv_ps(t, _mm256_set1_ps(-1.0f), failed);
Whilst the final code may look a bit nasty, it will end up being significantly faster than your approach. The devil is in the details though....
One major problem with this approach is that you have a choice between testing 1 ray against 8 triangles, or testing 8 rays against 1 triangle. For primarily rays this probably isn't a big deal. For secondary rays that have a habit of scattering in different directions), things can start to get a bit annoying. There's a good chance most of the ray tracing code will end up following a pattern of: test -> sort -> batch -> test -> sort -> batch
If you don't follow that pattern, you're pretty much never going to get the most out of the vector units. (Thankfully the compress/expand instructions in AVX512 help out with this quite a lot!)