I'm trying to find the most way of performing 8 bit unsigned compares using SSE (up to SSE 4.2).
The most common case I'm working on is comparing for > 0U, e.g.
_mm_cmpgt_epu8(v, _mm_setzero_si128()) // #1
(which of course can also be considered to be a simple test for non-zero.)
But I'm also somewhat interested in the more general case, e.g.
_mm_cmpgt_epu8(v1, v2) // #2
The first case can be implemented with 2 instructions, using various different methods, e.g. compare with 0 and then invert the result. The second case typically requires 3 instructions, e.g. subtract 128 from both operands and perform signed compare. (See this question for various 3 instruction solutions.)
What I'm looking for ideally is a single instruction solution for #1, and a two instruction solution for #2. If neither of these are possible then I'm also interested in thoughts as to which of the various possible 2 or 3 instruction implementations is most efficient on modern Intel CPUs (Sandy Bridge, Ivy Bridge, Haswell).
Best implementations for case #2 so far:
- compare equal with unsigned max and invert result:
#define _mm_cmpgt_epu8(v0, v1) \ _mm_andnot_si128(_mm_cmpeq_epi8(_mm_max_epu8(v0, v1), v1), \ _mm_set1_epi8(-1))
Two arithmetic instructions + one bitwise = 1.33 throughput.
- invert sign bits for both arguments (== subtract 128) and use signed compare:
#define _mm_cmpgt_epu8(v0, v1) \ _mm_cmpgt_epi8(_mm_xor_si128(v0, _mm_set1_epi8(-128)), \ _mm_xor_si128(v1, _mm_set1_epi8(-128)))
One arithmetic instruction + two bitwise = 1.16 throughput.
Best implementations for case #1, derived from case #2 implementations above:
- 1.
#define _mm_cmpgtz_epu8(v0) \ _mm_andnot_si128(_mm_cmpeq_epi8(v0, _mm_set1_epi8(0)), \ _mm_set1_epi8(-1))
One arithmetic instruction + one bitwise = 0.83 throughput.
- 2.
#define _mm_cmpgtz_epu8(v0) \ _mm_cmpgt_epi8(_mm_xor_si128(v0, _mm_set1_epi8(-128)), \ _mm_set1_epi8(-128)))
One arithmetic instruction + one bitwise = 0.83 throughput.
_mm_subs_epu8
is enough to reduce problem #2 to problem #1. C) The question can be solved once and for all by using superoptimization. I think it is possible to perform bruteforce search over all sequences of 2 SSE instructions with only 2 registers and a limited set of constants (e.g. the ones produced by_mm_set1_epi8
). – stgatilov