AVX2 SIMD Instrinsics 16-bit to 8-bit vice-versa

Question

I have a c++ (or c-like) function below which I am attempting to vectorize. The function is one of many variations of image compositing where it takes a Y,U or V image plane with a chroma 444 subsampling and composites/overlays a src image onto a dst image (where the src image also contains alpha transparency).

#include <cstdint>


void composite(uint8_t *__restrict__ pSrc,  // Source plane
               uint8_t *__restrict__ pSrcA, // Source alpha plane 
               uint8_t *__restrict__ pDst,  // Destination plane
               const std::size_t nCount)    // Number of component pixels to process.
{
    for (std::size_t k = 0; k < nCount; ++k)
    {
        uint16_t w = (pSrc[k] * pSrcA[k]);
        uint16_t x = (255 - pSrcA[k]) * pDst[k];
        uint16_t y = w+x;
        uint16_t z = y / uint16_t{255};
        pDst[k] = static_cast<uint8_t>(z);
    }
}

In the AVX2 vectorized equivalent, I'm struggling to understand how to efficiently read 8-bits convert to 16-bits and (after processing/compositing) to finally convert the 16-bit samples back to 8-bit to store back to memory. On the read side, I'm using an intermediate xmm register - that doesn't seem the best approach; I'm guessing there will be a performance penalty when mixing families of registers.

I've come up with (incomplete):

#include <cstdint>

#include <immintrin.h>
#include <emmintrin.h>


///////////////////////////////////////////////////////////////////////////
// Credit: https://stackguides.com/questions/35285324/how-to-divide-16-bit-integer-by-255-with-using-sse
#define AVX2_DIV255_U16(x) _mm256_srli_epi16(_mm256_mulhi_epu16(x, _mm256_set1_epi16((short)0x8081)), 7)

///////////////////////////////////////////////////////////////////////////
/// Blends/composites/overlays two planes of Y, U, or V plane with 4:4:4 chroma subsampling over the other.
/// \param d The destination Y, U , or V component
/// \param s The source Y, U, or V component
/// \param sa The source alpha component
/// \param pixels The number of pixels that require processing.
/// \return The number of pixels processed.
int blend_plane_pixels_444_vectorized(uint8_t *__restrict__ d,
                                      uint8_t *__restrict__ s,
                                      uint8_t *__restrict__ sa,
                                      const int pixels)
{
    int n = 0; // Return number of component pixels processed.
    for (int k = 0; k + 32 <= pixels; k += 32)
    {
        // Load first 16 (unaligned) of d, s, sa
        // TODO: This efficient mixing xmm registers with ymm??
        auto vecD0 = _mm256_cvtepu8_epi16(_mm_loadu_si128((__m128i_u *)d));
        auto vecS0 = _mm256_cvtepu8_epi16(_mm_loadu_si128((__m128i_u *)s));
        auto vecSa0 = _mm256_cvtepu8_epi16(_mm_loadu_si128((__m128i_u *)sa));

        // Load second 16 (unaligned) of d, s, sa
        auto vd1 = _mm256_cvtepu8_epi16(_mm_loadu_si128((__m128i_u *)d+16));
        auto vs1 = _mm256_cvtepu8_epi16(_mm_loadu_si128((__m128i_u *)s+16));
        auto vsa1 = _mm256_cvtepu8_epi16(_mm_loadu_si128((__m128i_u *)sa+16));

        // Load 255 into register
        auto vec255 = _mm256_set1_epi16(255);

        // uint16_t w = (pSrc[k] * pSrcA[k]);
        auto vecW0 = _mm256_mullo_epi16(vecS0, vecSa0);
        auto vecW1 = _mm256_mullo_epi16(vs1, vsa1);

        // uint16_t x = (255 - pSrcA[k]) * pDst[k];
        auto vecX0 = _mm256_mullo_epi16(_mm256_subs_epu16(vec255, vecSa0), vecD0);
        auto vecX1 = _mm256_mullo_epi16(_mm256_subs_epu16(vec255, vsa1), vd1);

        // Load 127 into register
        auto vec127 = _mm256_set1_epi16(127);

        // uint16_t y = w+x;
        auto vecY0 = _mm256_adds_epu16(_mm256_adds_epu16(vecW0, vecX0), vec127);
        auto vecY1 = _mm256_adds_epu16(_mm256_adds_epu16(vecW1, vecX1), vec127);

        // uint16_t z = y / uint16_t{255};
        auto vecZ0 = AVX2_DIV255_U16(vecY0);
        auto vecZ1 = AVX2_DIV255_U16(vecY1);

        // TODO: How to get this back into 8-bit samples so that it can be stored
        //       back into array.
        auto vecResult = _mm256_blendv_epi8(vecZ0, vecZ1, _mm256_set1_epi16(127));

        // Write data back to memory (unaligned)
        _mm256_storeu_si256((__m256i*)d, vecResult);

        d += 32;
        s += 32;
        sa += 32;
        n += 32;
    }

    return n;
}

SIMD is not my forte, and it's something I need to get better at - please be gentle. I imagine there is probably many tweaks that I could apply to current vectorized code (suggestions welcome!)

Development Environment:

Linux Ubuntu 18.04
G++ v8.3.0
c++14

Typically you need a vpackuswb + shuffle to account for lane-crossing. Or unpack lo/hi a pair of 256-bit vectors with _mm256_setzero_si256() in the first place so repacking is just in-lane vpackuswb. — Peter Cordes
Or to save on instructions, you might want to interleave(?) to set up for vpmaddubsw, if you can make that work. (It treats one input as signed, the other as unsigned, so it won't work easily for pixels * alpha, except maybe with a range-shift to signed and then adjust? But saturation is a problem so no, I don't think so). — Peter Cordes
Thanks for the pointers! The former option seems more appealing but I'll take a look. — ZeroDefect

chtz chtz · Accepted Answer · 2019-10-19T11:10:18

Generally, if you need to re-pack the result to 8-bit integers you are better by either unpacking with zero using punpcklbw/punpckhbw and re-packing the result using packuswb. Or sometimes you can mask out the odd and even bytes into separate registers, do the calculation and bit-or the results together.

The "problem" with _mm256_cvtepu8_epi16/vpmovzxbw is that it is lane-crossing (i.e., it takes input from only the lower 128 bit half (or memory), but the result is in the upper and the lower half), and there is no (easy) solution to join 16 bit values from different lanes back to one (until AVX512 lane-crossing one-register pack instructions with saturation or truncation).

In your case you can actually pack together the d and s values in one register and the a and 255-a values in another and use vpmaddubsw for multiplication and addition. You need to subtract 128 from the d and s values before packing them together, since one argument needs to be a signed int8. The result will be off by 128*255, but than can be compensated, especially if you add 127 for rounding anyway. (If you don't, you can add 128 to each byte after dividing (signed division with rounding down) and repacking.

Untested code, using the same signature as your attempt:

// https://stackguides.com/questions/35285324/how-to-divide-16-bit-integer-by-255-with-using-sse
inline __m256i div255_epu16(__m256i x) {
    __m256i mulhi = _mm256_mulhi_epu16(x, _mm256_set1_epi16(0x8081));
    return _mm256_srli_epi16(mulhi, 7);
}

int blend_plane_pixels_444_vectorized(uint8_t *__restrict__ d,
                                      uint8_t *__restrict__ s,
                                      uint8_t *__restrict__ sa,
                                      const int pixels)
{
    int n = 0; // Return number of component pixels processed.
    for (int k = 0; k + 32 <= pixels; k += 32)
    {
        // Load 32 (unaligned) of d, s, sa
        __m256i vecD = _mm256_loadu_si256((__m256i_u *)d);
        __m256i vecS = _mm256_loadu_si256((__m256i_u *)s );
        __m256i vecA = _mm256_loadu_si256((__m256i_u *)sa);

        // subtract 128 from D and S to have them in the signed domain
        // subtracting 128 is equivalent ot xor with 128
        vecD = _mm256_xor_si256(vecD, _mm256_set1_epi8(0x80));
        vecS = _mm256_xor_si256(vecS, _mm256_set1_epi8(0x80));

        // calculate 255-a (equivalent to 255 ^ a):
        __m256i vecA_ = _mm256_xor_si256(vecA, _mm256_set1_epi8(0xFF));

        __m256i vecAA_lo = _mm256_unpacklo_epi8(vecA, vecA_);
        __m256i vecSD_lo = _mm256_unpacklo_epi8(vecS, vecD);
        __m256i vecAA_hi = _mm256_unpackhi_epi8(vecA, vecA_);
        __m256i vecSD_hi = _mm256_unpackhi_epi8(vecS, vecD);

        // R = a * (s-128) + (255-a)*(d-128) = a*s + (255-a)*d - 128*255
        __m256i vecR_lo = _mm256_maddubs_epi16(vecAA_lo,vecSD_lo);
        __m256i vecR_hi = _mm256_maddubs_epi16(vecAA_hi,vecSD_hi);

        // shift back to unsigned domain and add 127 for rounding
        vecR_lo = _mm256_add_epi16(vecR_lo, _mm256_set1_epi16(127+128*255));
        vecR_hi = _mm256_add_epi16(vecR_hi, _mm256_set1_epi16(127+128*255));

        // divide (rounding down)
        vecR_lo = div255_epu16(vecR_lo);
        vecR_hi = div255_epu16(vecR_hi);

        // re-join lower and upper half:
        __m256i vecResult = _mm256_packus_epi16(vecR_lo, vecR_hi);
        // Write data back to memory (unaligned)
        _mm256_storeu_si256((__m256i*)d, vecResult);

        d += 32;
        s += 32;
        sa += 32;
        n += 32;
    }

    return n;
}

Godbolt-Link: https://godbolt.org/z/EYzLw2 Note that -march=haswell or whichever architecture you want to support is crucial, because otherwise gcc will not use unaligned data as memory-source operand. Of course, general vectorization rules apply, i.e., if you have control over the alignment, prefer allocating your data aligned. And if not, you can peel of the first unaligned bytes (e.g., from d) to have at least one load and the store aligned.

Clang will unroll the loop (to two inner iterations) which will slightly improve performance for sufficiently large input.

AVX2 SIMD Instrinsics 16-bit to 8-bit vice-versa

1 Answers