I'm experimenting with a cross-platform SIMD library ala ecmascript_simd aka SIMD.js, and part of this is providing a few "horizontal" SIMD operations. In particular, the API that library offers includes any(<boolN x M>) -> bool and all(<boolN x M>) -> bool functions, where <T x K> is a vector of K elements of type T and boolN is an N-bit boolean, i.e. all ones or all zeros, as SSE and NEON return for their comparison operations.
For example, let v be a <bool32 x 4> (a 128-bit vector), it could be the result of VCLT.S32 or something. I'd like to compute all(v) = v[0] && v[1] && v[2] && v[3] and any(v) = v[0] || v[1] || v[2] || v[3].
This is easy with SSE, e.g. movmskps will extract the high bit of each element, so all for the type above becomes (with C intrinsics):
#include<xmmintrin.h>
int all(__m128 x) {
return _mm_movemask_ps(x) == 8 + 4 + 2 + 1;
}
and similarly for any.
I'm struggling to find obvious/nice/efficient ways to implement this with NEON, which doesn't support an instruction like movmskps. There's the approach of simply extracting each element and computing with scalars. E.g. there's the naive method but there's also the approach of using the "horizontal" operations NEON supports, like VPMAX and VPMIN.
#include<arm_neon.h>
int all_naive(uint32x4_t v) {
return v[0] && v[1] && v[2] && v[3];
}
int all_horiz(uint32x4_t v) {
uint32x2_t x = vpmin_u32(vget_low_u32(v),
vget_high_u32(v));
uint32x2_t y = vpmin_u32(x, x);
return x[0] != 0;
}
(One can do a similar thing for the latter with VPADD, which may be faster, but it's fundamentally the same idea.)
Are there are other tricks one can use to implement this?
Yes, I know that horizontal operations are not great with SIMD vector units. But sometimes it is useful, e.g. many SIMD implementations of mandlebrot will operate on 4 points at once, and bail out of the inner loop when all of them are out of range... which requires doing a comparison and then a horizontal and.
movemskpsisptest. You can use this forandoror. I think Neon has the same instructionvtest. I have not implemented this yet but I think you can find your answer here fastest-way-to-test-a-128-bit-neon-register-for-a-value-of-0-using-intrinsics. - Z bosonvtstturns out to not be especially useful here, sadly (because you already have a vector of 0/-1 values from the compare). Nils' suggestion from the linked answer (saturated add + read Q bit) doesn't work out nicely in general because the Q bit is sticky so you need to clear it first with RMW. So the usual approach is multiplevpmax/vpminon arm32 and a singleumaxv/uminvon arm64. - Stephen Canonptestbut it appears you have already found the best solution with ARM: namely min/max twice with arm7 and once with arm8. - Z bosonmovmskpsequivalent. Might not be the right building block for things like testing if any element was true, though. (e.g. only packing down to 4 bytes instead of 4 bits may be easier, and testing a 32-bit integer for 0 or -1) - Peter Cordes