With AVX/AVX2/SSE __m128i set all bytes that are negative to -128 (0x80) and leave all other bytes alone

2

votes

Basically what I want to do is take an __m128i register and for each negative byte set its values to -128 (0x80) and not change any of the positive values.

Exact is:

signed char __m128_as_char_arr[16] = {some data};
for(int i = 0; i < 16; i++) {
     if (__m128_as_char_arr[i] < 0) { //alternative __m128_as_char_arr[i] & 0x80
           __m128_as_char_arr[i] = 0x80;
     }

}

I am thinking the best way to do this is something along the lines of:

__m128i v = some data;
int mask = _mm_movemask_epi8(_mm_cmpgt_epi8(_mm_set1_epi8(0xff), v));

// use mask in some way to only set chars with 1s bit set

But I don't know (1) what instruction to use to set only the bytes assosiated with mask and (2) if there is a better way to do this (either without the mask at all or a better way to generate mask).

ssesimdavxavx2

7

votes

You can treat the values as if they were unsigned and use a min operation (_mm_min_epu8 et al), e.g.

v = _mm_min_epu8(v, _mm_set1_epi8(128));

As well as being a cheap instruction, this works for SSE2 and up.

3

votes

Update: @PaulR came up with an even better idea. Accept that answer instead. _mm_min_epu8 (1 uop) is at least as cheap as _mm_blendv_epi8 (2 uops on most), and only requires SSE2.

Less good than `_mm_min_epu8`, leaving it here for in case it helps for related cases where the `min` trick doesn't exactly work.

SSE4.1 (and thus AVX and later) has a variable-blend that selects based on the top bit of each byte. You can use your vector as the blend control and one of the data inputs.

// SSE4.1 or AVX1.  Or for __m256i, AVX2
__m128i  negative_to_min(__m128i v){
    // take 2nd operand for elements of v where the high bit is set
    return _mm_blendv_epi8(v, _mm_set1_epi8(0x80), v);
}

With only SSE2, you want 0 > v with pcmpgtb to identify negative elements. The straightforward way would be the usual AND/ANDN/OR to blend without pblendvb, but we can be more clever based on the fact that the top bit of the result always matches the top bit of the input, and that the result we want for the negative case is in fact x & 0x80.

                   // negative        non-neg
m = 0x80 ^ (0>x);  // 0x80             0x7f
x &= m;            // x&0x80 = 0x80    x & 0x7f = x

// SSE2
__m128i  negative_to_min(__m128i v)
{
    __m128i  neg = _mm_cmpgt_epi8(_mm_setzero_si128(), v);    // neg        non-neg
    __m128i  mask = _mm_xor_si128(neg, _mm_set1_epi8(0x80));  // 0x80   or  0x7f
    return   _mm_and_si128(mask, v);
}

This is fewer instructions (3), and critical path latency no worse than PCMPGTB / AND / ANDN / OR. It also shouldn't need any extra movdqa instructions, if it generates a zero vector cheaply with pxor xmm0,xmm0 and then overwrites it as the pcmpgtb destination.

If you had a use for a 0x7f instead of 0x80 constant somewhere else, you could xor with 0x7f and use _mm_andn_si128(mask, v); as the last step, to invert the mask. Otherwise, best to use a commutative operation to give the compiler an easier time optimizing.

re: Your approach: Without AVX512, movemask isn't a useful building block for this. There's no SIMD way to use a bitmap with a vector. Compare instructions / intrinsics before AVX512 produce vectors masks instead of bitmasks so you can use them with AND/ANDN/XOR/OR bitwise operations.

Also, your -1 > v would mis-identify -1 as non-negative.

With AVX/AVX2/SSE __m128i set all bytes that are negative to -128 (0x80) and leave all other bytes alone

2 Answers

Less good than _mm_min_epu8, leaving it here for in case it helps for related cases where the min trick doesn't exactly work.

Less good than `_mm_min_epu8`, leaving it here for in case it helps for related cases where the `min` trick doesn't exactly work.