2
votes

This question seems similar to Getting max value in a __m128i vector with SSE? but with shorts and minimum instead of integer + maximum. This is what I came up with:

typedef short int weight;

weight horizontal_min_Vec4i(__m128i x) {
    __m128i max1 = _mm_shufflehi_epi16(x, _MM_SHUFFLE(0, 0, 3, 2));
    __m128i max1b = _mm_shufflelo_epi16(x, _MM_SHUFFLE(0, 0, 3, 2));
    __m128i max2 = _mm_min_epi16(max1, max1b);
    //max2 = _mm_min_epi16(max2, x);
    max1 = _mm_shufflehi_epi16(max2, _MM_SHUFFLE(0, 0, 0, 1));
    max1b = _mm_shufflelo_epi16(max2, _MM_SHUFFLE(0, 0, 0, 1));
    __m128i max3 = _mm_min_epi16(max1, max1b);
    max2 = _mm_min_epi16(max2, max3);
    return min(_mm_extract_epi16(max2, 0), _mm_extract_epi16(max2, 4));
}

The function basically does the same as the answer in https://stackoverflow.com/a/18616825/1500111 for the upper and lower parts of x. So, I know the minimum value is either in the position 0 or 4 of the __m128i variable max2. Although it is much faster than the no SIMD function horizontal_min_Vec4i_Plain(__m128i x) shown below, I am afraid the bottleneck is the _mm_extract_epi16 operation at the last line. Is there a better way to achieve this, for a better speed up? I am using Haswell so I have access to the latest SSE extensions.

weight horizontal_min_Vec4i_Plain(__m128i x) {
    weight result[8] __attribute__((aligned(16)));
    _mm_store_si128((__m128i *) result, x);
    weight myMin = result[0];
    for (int l = 1; l < 8; l++) {
        if (myMin > result[l]) {
            myMin = result[l];
        }
    }
    return myMin;
}
1
Why is taking the horizontal min of 16 shorts critical?Z boson
@Zboson Taking the minimum for __m128i x would be done for more than 100k times. For each of this 100K times, there was a total of 24-64 SIMD additions + MAX to create the short values inside __m128i x.Alexandros
Are you just looking for _mm_minpos_epu16?harold
@harold I thought _mm_minpos_epu16 was MS specific. I will try that out.Alexandros
XOR with _mm_set1_epi16(0x8000) if you want to find the signed minimumharold

1 Answers

2
votes

Signed and unsigned comparison are almost the same, except that the range with the top bit set is treated as bigger than the range with the top bit not set in unsigned comparisons, and as smaller in signed comparisons. That means signed and unsigned comparisons can be converted into each other by these rules:

x <s y = (x ^ signbit) <u (y ^ signbit)
x <u y = (x ^ signbit) <s (y ^ signbit)

This property transfers directly to min and max, so:

min_s(x, y) = min_u(x ^ signbit, y ^ signbit) ^ signbit

And then we can use _mm_minpos_epu16 to handle the horizontal minimum, to get, in total, something like

__m128i xs = _mm_xor_si128(x, _mm_set1_epi16(0x8000));
return _mm_extract_epi16(_mm_minpos_epu16(xs), 0) - 0x8000;

The - 0x8000 is ^ 0x8000 and sign-extension (extract zero-extends) rolled into one.