10
votes

I am trying to convert the following code to SSE/AVX:

float x1, x2, x3;
float a1[], a2[], a3[], b1[], b2[], b3[];
for (i=0; i < N; i++)
{
    if (x1 > a1[i] && x2 > a2[i] && x3 > a3[i] && x1 < b1[i] && x2 < b2[i] && x3 < b3[i])
    {
        // do something with i
    }
}

Here N is a small constant, let's say 8. The if(...) statement evaluates to false most of the time.

First attempt:

__m128 x; // x1, x2, x3, 0
__m128 a[N]; // packed a1[i], a2[i], a3[i], 0 
__m128 b[N]; // packed b1[i], b2[i], b3[i], 0

for (int i = 0; i < N; i++)
{
    __m128 gt_mask = _mm_cmpgt_ps(x, a[i]);
    __m128 lt_mask = _mm_cmplt_ps(x, b[i]);
    __m128 mask = _mm_and_ps(gt_mask, lt_mask);
    if (_mm_movemask_epi8 (_mm_castps_si128(mask)) == 0xfff0)
    {
        // do something with i
    }
}

This works, and is reasonably fast. The question is, is there be a more efficient way of doing this? In particular, if there is a register with results from SSE or AVX comparisons on floats (which put 0xffff or 0x0000 in that slot), how can the results of all the comparisons be (for example) and-ed or or-ed together, in general? Is PMOVMSKB (or the corresponding _mm_movemask intrinsic) the standard way to do this?

Also, how can AVX 256-bit registers be used instead of SSE in the code above?

EDIT:

Tested and benchmarked a version using VPTEST (from _mm_test* intrinsic) as suggested below.

__m128 x; // x1, x2, x3, 0
__m128 a[N]; // packed a1[i], a2[i], a3[i], 0
__m128 b[N]; // packed b1[i], b2[i], b3[i], 0
__m128i ref_mask = _mm_set_epi32(0xffff, 0xffff, 0xffff, 0x0000);

for (int i = 0; i < N; i++)
{
    __m128 gt_mask = _mm_cmpgt_ps(x, a[i]);
    __m128 lt_mask = _mm_cmplt_ps(x, b[i]);
    __m128 mask = _mm_and_ps(gt_mask, lt_mask);
    if (_mm_testc_si128(_mm_castps_si128(mask), ref_mask))
    {
        // do stuff with i
    }
}

This also works, and is fast. Benchmarking this (Intel i7-2630QM, Windows 7, cygwin 1.7, cygwin gcc 4.5.3 or mingw x86_64 gcc 4.5.3, N=8) shows this to be identical speed to the code above (within less than 0.1%) on 64bit. Either version of the inner loop runs in about 6.8 clocks average on data which is all in cache and for which the comparison returns always false.

Interestingly, on 32bit, the _mm_test version runs about 10% slower. It turns out that the compiler spills the masks after loop unrolling and has to re-read them back; this is probably unnecessary and can be avoided in hand-coded assembly.

Which method to choose? It seems that there is no compelling reason to prefer VPTEST over VMOVMSKPS. Actually, there is a slight reason to prefer VMOVMSKPS, namely it frees up a xmm register which would otherwise be taken up by the mask.

2

2 Answers

12
votes

If you're working with floats, you generally want to use MOVMSKPS (and the corresponding AVX instruction VMOVMSKPS) instead of PMOVMSKB.

That aside, yes, this is one standard way of doing this; you can also use PTEST (VPTEST) to directly update the condition flags based on the result of an SSE or AVX AND or ANDNOT.

2
votes

To address your editted version:

If you're going to directly branch on the result of PTEST, it's faster to use it than to MOVMSKPS to a GP reg, and then do a TEST on that to set the flags for a branch instruction. On AMD CPUs, moving data between vector and integer domains is very slow (5 to 10 cycle latency, depending on the CPU model).

As far as needing an extra register for PTEST, you often don't. You can use the same value as both args, like with the regular non-vector TEST instruction. (Testing foo & foo is the same as testing foo).

In your case, you do need to check that all the vector elements are set. If you reversed the comparison, and then OR the result together (so you're testing !(x1 < a1[i]) || !(x2 < a2[i]) || ...) you'd have vectors you needed to test for all zero, rather than for all ones. But dealing with the low element is still problematic. If you needed to save a register to avoid needing a vector mask for PTEST / VTESTPS, you could right-shift the vector by 4 bytes before doing a PTEST and branching on it being all-zero.

AVX introduced VTESTPS, which I guess avoids the possible float -> int bypass delay. If you used any int-domain instructions to generate inputs for a test, though, you might as well use (V)PTEST. (I know you were using intrinsics, but they're a pain to type and look at compared to mnemonics.)