Update: @PaulR came up with an even better idea. Accept that answer instead. _mm_min_epu8
(1 uop) is at least as cheap as _mm_blendv_epi8
(2 uops on most), and only requires SSE2.
Less good than _mm_min_epu8
, leaving it here for in case it helps for related cases where the min
trick doesn't exactly work.
SSE4.1 (and thus AVX and later) has a variable-blend that selects based on the top bit of each byte. You can use your vector as the blend control and one of the data inputs.
// SSE4.1 or AVX1. Or for __m256i, AVX2
__m128i negative_to_min(__m128i v){
// take 2nd operand for elements of v where the high bit is set
return _mm_blendv_epi8(v, _mm_set1_epi8(0x80), v);
}
With only SSE2, you want 0 > v
with pcmpgtb
to identify negative elements. The straightforward way would be the usual AND/ANDN/OR to blend without pblendvb, but we can be more clever based on the fact that the top bit of the result always matches the top bit of the input, and that the result we want for the negative case is in fact x & 0x80
.
// negative non-neg
m = 0x80 ^ (0>x); // 0x80 0x7f
x &= m; // x&0x80 = 0x80 x & 0x7f = x
// SSE2
__m128i negative_to_min(__m128i v)
{
__m128i neg = _mm_cmpgt_epi8(_mm_setzero_si128(), v); // neg non-neg
__m128i mask = _mm_xor_si128(neg, _mm_set1_epi8(0x80)); // 0x80 or 0x7f
return _mm_and_si128(mask, v);
}
This is fewer instructions (3), and critical path latency no worse than PCMPGTB / AND / ANDN / OR. It also shouldn't need any extra movdqa
instructions, if it generates a zero vector cheaply with pxor xmm0,xmm0
and then overwrites it as the pcmpgtb destination.
If you had a use for a 0x7f
instead of 0x80
constant somewhere else, you could xor with 0x7f
and use _mm_andn_si128(mask, v);
as the last step, to invert the mask. Otherwise, best to use a commutative operation to give the compiler an easier time optimizing.
re: Your approach: Without AVX512, movemask isn't a useful building block for this. There's no SIMD way to use a bitmap with a vector. Compare instructions / intrinsics before AVX512 produce vectors masks instead of bitmasks so you can use them with AND/ANDN/XOR/OR bitwise operations.
Also, your -1 > v
would mis-identify -1 as non-negative.