Optimizing horizontal boolean reduction in ARM NEON

Question

I'm experimenting with a cross-platform SIMD library ala ecmascript_simd aka SIMD.js, and part of this is providing a few "horizontal" SIMD operations. In particular, the API that library offers includes any(<boolN x M>) -> bool and all(<boolN x M>) -> bool functions, where <T x K> is a vector of K elements of type T and boolN is an N-bit boolean, i.e. all ones or all zeros, as SSE and NEON return for their comparison operations.

For example, let v be a <bool32 x 4> (a 128-bit vector), it could be the result of VCLT.S32 or something. I'd like to compute all(v) = v[0] && v[1] && v[2] && v[3] and any(v) = v[0] || v[1] || v[2] || v[3].

This is easy with SSE, e.g. movmskps will extract the high bit of each element, so all for the type above becomes (with C intrinsics):

#include<xmmintrin.h>
int all(__m128 x) {
    return _mm_movemask_ps(x) == 8 + 4 + 2 + 1;
}

and similarly for any.

I'm struggling to find obvious/nice/efficient ways to implement this with NEON, which doesn't support an instruction like movmskps. There's the approach of simply extracting each element and computing with scalars. E.g. there's the naive method but there's also the approach of using the "horizontal" operations NEON supports, like VPMAX and VPMIN.

#include<arm_neon.h>

int all_naive(uint32x4_t v) {
    return v[0] && v[1] && v[2] && v[3];
}
int all_horiz(uint32x4_t v) {
    uint32x2_t x = vpmin_u32(vget_low_u32(v),
                             vget_high_u32(v));
    uint32x2_t y = vpmin_u32(x, x);
    return x[0] != 0;
}

(One can do a similar thing for the latter with VPADD, which may be faster, but it's fundamentally the same idea.)

Are there are other tricks one can use to implement this?

Yes, I know that horizontal operations are not great with SIMD vector units. But sometimes it is useful, e.g. many SIMD implementations of mandlebrot will operate on 4 points at once, and bail out of the inner loop when all of them are out of range... which requires doing a comparison and then a horizontal and.

The more interesting SSE instruction to movemskps is ptest. You can use this for and or or. I think Neon has the same instruction vtest. I have not implemented this yet but I think you can find your answer here fastest-way-to-test-a-128-bit-neon-register-for-a-value-of-0-using-intrinsics. — Z boson
@Zboson: vtst turns out to not be especially useful here, sadly (because you already have a vector of 0/-1 values from the compare). Nils' suggestion from the linked answer (saturated add + read Q bit) doesn't work out nicely in general because the Q bit is sticky so you need to clear it first with RMW. So the usual approach is multiple vpmax/vpmin on arm32 and a single umaxv/uminv on arm64. — Stephen Canon
I was not aware that many "SIMD implementations of mandlebrot will operate on 4 points at once, and bail out of the inner loop when all of them are out of range..." I have been doing this for awhile myself (actually 8 pixels with AVX for single float). For x86 I use ptest but it appears you have already found the best solution with ARM: namely min/max twice with arm7 and once with arm8. — Z boson
@StephenCanon, in that case maybe you can provide an answer to fastest-way-to-test-a-128-bit-neon-register-for-a-value-of-0-using-intrinsics. — Z boson
Related: NEON pack vector compare result into bitmap asks for a movmskps equivalent. Might not be the right building block for things like testing if any element was true, though. (e.g. only packing down to 4 bytes instead of 4 bits may be easier, and testing a 32-bit integer for 0 or -1) — Peter Cordes

Denis Yaroshevskiy Denis Yaroshevskiy · Accepted Answer · 2021-02-15T21:32:24

This is my current solution that is implemented in eve library.

If your backend has C++20 support, you can just use the library: it has implementations for arm-v7, arm-v8 (only little endian at the moment) and all x86 from sse2 to avx-512. It's open source and MIT licensed. In beta at the moment. Feel free to reach out (for example with an issue) if you are trying out the library.

Take everything with a grain of salt - I don't yet have the arm benchmarks set up.

NOTE: On top of basic all and any we also have a movemask equivalent to do more complex operations like first_true. That wasn't part of the question and it's not amazing but the code can be found here

ARM-V7, 8 bytes register

Now, arm-v7 is 32 bit architecture, so we try to get to 32 bit elements where we can.

any

Use pairwise 32 bit max. If any element is true, the max is true.

// cast to dwords
dwords = vpmax_u32(dwords, dwords);
return vget_lane_u32(dwords, 0);

all

Pairwise min instead of max. Also what you test against changes. If you have 4 byte element - just test for true. If shorts or chars - you need to test for -1;

// cast to dwords
dwords = vpmin_u32(dwords, dwords);
std::uint32_t combined = vget_lane_u32(dwords, 0);

// Assuming T is your scalar type
if constexpr ( sizeof(T) >= 4 ) return combined;

// I decided that !~ is better than -1, compiler will figure it out.
return !~combined;

ARM-V7, 16 bytes register

For anything bigger than chars, just do a conversion to a 64 bit one. Here is the list of vector narrow integer conversions.

For chars, the best I found is to reinterpret as uint32 and do an extra check. So compare for == -1 for all and > 0 for any. Seemed nicer asm the split in two 8 byte registers.

Then just do all/any on that dword register.

ARM-v8, 8 byte

ARM-v8 has 64 bit support, so you can just get a 64 bit lane. That one is trivially testable.

ARM-v8, 16 byte

We use vmaxvq_u32 since there is not a 64 bit one for any and vminvq_u32, vminvq_u16 or vminvq_u8 for all depending on the element size. (Which is similar to glibc strlen)

Conclusion

Lack of benchmarks definitely makes me worried, some instructions are problematic sometimes and I don't know about it. Regardless, that's the best I've got, so far at least.

Optimizing horizontal boolean reduction in ARM NEON

2 Answers