3
votes

I want to find max of two vectors containing 8 x 16 bit unsigned int elements.

__m128i vi_A= _mm_loadu_si128(reinterpret_cast<const __m128i*>(&pSrc[0])); // 8 16-Bit Elements
__m128i vi_B= _mm_loadu_si128(reinterpret_cast<const __m128i*>(&pSrc1[0])); // 8 16-Bit Elements
__m128i vi_Max = _mm_max_epi16(vi_A,vi_B);  //<-- Error 

But this _mm_max_epi16 is a signed int comparison and this causes overflow. So I tried to use the unsigned version of it by using SSE4.1 intrinsic

vi_Max = _mm_max_epu16(vi_A,vi_B);

but I'm not allowed to use SSE4.1 intrinsics. So what is the efficient way to find the max of these elements?

1

1 Answers

5
votes

One (somewhat inefficient) way of doing it is to offset the input values by 0x8000 and then add this offset back to the result, e.g.:

#ifndef __SSE4_1__
inline __m128i _mm_max_epu16(const __m128i v0, const __m128i v1)
{
    return _mm_add_epi16(
               _mm_max_epi16(
                   _mm_sub_epi16(v0, _mm_set1_epi16(0x8000)),
                   _mm_sub_epi16(v1, _mm_set1_epi16(0x8000))),
               _mm_set1_epi16(0x8000));
}
#endif

With gcc or clang this generates one load instruction for the constant and four arithmetic instructions.


_mm_xor_si128_mm_add_epi16_mm_sub_epi16
#ifndef __SSE4_1__
inline __m128i _mm_max_epu16(const __m128i v0, const __m128i v1)
{
    return _mm_xor_si128(
               _mm_max_epi16(
                   _mm_xor_si128(v0, _mm_set1_epi16(0x8000)),
                   _mm_xor_si128(v1, _mm_set1_epi16(0x8000))),
               _mm_set1_epi16(0x8000));
}
#endif