3
votes

Can FP compares like SSE2 _mm_cmpeq_pd / AVX _mm_cmp_pd be used to compare 64 bit integers?

The idea is to emulate missing _mm_cmpeq_epi64 that would be similar to _mm_cmpeq_epi8, _mm_cmpeq_epi16, _mm_cmpeq_epi32.

The concern is I'm not sure if the comparison is bitwise, or handles floating point specifically, like NAN values are always unequal.

1
Not directly, since NaN != NaN and -0.0 == 0.0. Probably something is possible with some clever bit-fiddling.chtz
@chtz, thanks, that's enough for me, you can post this as answer if you like, I wouldn't go into clever bit fiddling, it is easier to do with _mm_cmpeq_epi32 then.Alex Guteniev
I think this should work for any "normal" c: _mm_cmpeq_pd(c, _mm_xor_pd(c, _mm_xor_pd(a,b)));. Though interestingly, gcc optimizes away the c (but you can trick it by mixing the order of the xor operations): godbolt.org/z/jraEq9sebchtz

1 Answers

3
votes

AVX implies availability of SSE4.1 pcmpeqq is available, in that case you should just use _mm_cmpeq_epi64.

FP compares treat NaN != NaN, and -0.0 == +0.0, and if DAZ is set in MXCSR, treat any small integer as zero. (Because exponent = 0 means it represents a denormal, and Denormals-Are-Zero mode treats them as exactly zero on input to avoid possible speed penalties for any operations on any microarchitecture, including for compares. IIRC, modern microarchitectures don't have a penalty for subnormal inputs to compares, but do still for some other operations. In any case, programs built with -ffast-math set FTZ and DAZ for the main thread on startup.)

So FP compares are not really usable for integers unless you know that some but not all of bits [62:52] (inclusive) will be set.


It's much to use pcmpeqd (_mm_cmpeq_epi32) than to hack up some FP bit-manipulation. (Although @chtz suggested in comments you could do 42.0 == (42.0 ^ (a^b)) with xorpd, as long as the compiler doesn't optimize away the constant and compare against 0.0. That's a GCC bug without -ffast-math).

If you want a condition like at-least-one-match then you need to make sure both halves of a 64-bit element matched, like mask & (mask<<1) on a movmskps result, which can compile to lea / test. (You could mask & (mask<<4) on a pmovmskb result, but that's slightly less efficient because LEA copy-and-shift can only shift by 0..3.)

Of course "all-matched" doesn't care about element sizes so you can just use _mm_movemask_epi8 on any compare result, and check it against 0xFFFF.

If you want to use it for a blend with and/andnot/or, you can pshufd / pand to swap halves within 64-bit elements. (If you were feeding pblendvb or blendvpd, that would mean SSE4.1 was available so you should have used pcmpeqq.)

The more expensive one to emulate is SSE4.2 pcmpgtq, although I think GCC and/or clang do know how to emulate it when auto-vectorizing.