I have a c++ (or c-like) function below which I am attempting to vectorize. The function is one of many variations of image compositing where it takes a Y,U or V image plane with a chroma 444 subsampling and composites/overlays a src image onto a dst image (where the src image also contains alpha transparency).
#include <cstdint>
void composite(uint8_t *__restrict__ pSrc, // Source plane
uint8_t *__restrict__ pSrcA, // Source alpha plane
uint8_t *__restrict__ pDst, // Destination plane
const std::size_t nCount) // Number of component pixels to process.
{
for (std::size_t k = 0; k < nCount; ++k)
{
uint16_t w = (pSrc[k] * pSrcA[k]);
uint16_t x = (255 - pSrcA[k]) * pDst[k];
uint16_t y = w+x;
uint16_t z = y / uint16_t{255};
pDst[k] = static_cast<uint8_t>(z);
}
}
In the AVX2 vectorized equivalent, I'm struggling to understand how to efficiently read 8-bits convert to 16-bits and (after processing/compositing) to finally convert the 16-bit samples back to 8-bit to store back to memory. On the read side, I'm using an intermediate xmm register - that doesn't seem the best approach; I'm guessing there will be a performance penalty when mixing families of registers.
I've come up with (incomplete):
#include <cstdint>
#include <immintrin.h>
#include <emmintrin.h>
///////////////////////////////////////////////////////////////////////////
// Credit: https://stackguides.com/questions/35285324/how-to-divide-16-bit-integer-by-255-with-using-sse
#define AVX2_DIV255_U16(x) _mm256_srli_epi16(_mm256_mulhi_epu16(x, _mm256_set1_epi16((short)0x8081)), 7)
///////////////////////////////////////////////////////////////////////////
/// Blends/composites/overlays two planes of Y, U, or V plane with 4:4:4 chroma subsampling over the other.
/// \param d The destination Y, U , or V component
/// \param s The source Y, U, or V component
/// \param sa The source alpha component
/// \param pixels The number of pixels that require processing.
/// \return The number of pixels processed.
int blend_plane_pixels_444_vectorized(uint8_t *__restrict__ d,
uint8_t *__restrict__ s,
uint8_t *__restrict__ sa,
const int pixels)
{
int n = 0; // Return number of component pixels processed.
for (int k = 0; k + 32 <= pixels; k += 32)
{
// Load first 16 (unaligned) of d, s, sa
// TODO: This efficient mixing xmm registers with ymm??
auto vecD0 = _mm256_cvtepu8_epi16(_mm_loadu_si128((__m128i_u *)d));
auto vecS0 = _mm256_cvtepu8_epi16(_mm_loadu_si128((__m128i_u *)s));
auto vecSa0 = _mm256_cvtepu8_epi16(_mm_loadu_si128((__m128i_u *)sa));
// Load second 16 (unaligned) of d, s, sa
auto vd1 = _mm256_cvtepu8_epi16(_mm_loadu_si128((__m128i_u *)d+16));
auto vs1 = _mm256_cvtepu8_epi16(_mm_loadu_si128((__m128i_u *)s+16));
auto vsa1 = _mm256_cvtepu8_epi16(_mm_loadu_si128((__m128i_u *)sa+16));
// Load 255 into register
auto vec255 = _mm256_set1_epi16(255);
// uint16_t w = (pSrc[k] * pSrcA[k]);
auto vecW0 = _mm256_mullo_epi16(vecS0, vecSa0);
auto vecW1 = _mm256_mullo_epi16(vs1, vsa1);
// uint16_t x = (255 - pSrcA[k]) * pDst[k];
auto vecX0 = _mm256_mullo_epi16(_mm256_subs_epu16(vec255, vecSa0), vecD0);
auto vecX1 = _mm256_mullo_epi16(_mm256_subs_epu16(vec255, vsa1), vd1);
// Load 127 into register
auto vec127 = _mm256_set1_epi16(127);
// uint16_t y = w+x;
auto vecY0 = _mm256_adds_epu16(_mm256_adds_epu16(vecW0, vecX0), vec127);
auto vecY1 = _mm256_adds_epu16(_mm256_adds_epu16(vecW1, vecX1), vec127);
// uint16_t z = y / uint16_t{255};
auto vecZ0 = AVX2_DIV255_U16(vecY0);
auto vecZ1 = AVX2_DIV255_U16(vecY1);
// TODO: How to get this back into 8-bit samples so that it can be stored
// back into array.
auto vecResult = _mm256_blendv_epi8(vecZ0, vecZ1, _mm256_set1_epi16(127));
// Write data back to memory (unaligned)
_mm256_storeu_si256((__m256i*)d, vecResult);
d += 32;
s += 32;
sa += 32;
n += 32;
}
return n;
}
SIMD is not my forte, and it's something I need to get better at - please be gentle. I imagine there is probably many tweaks that I could apply to current vectorized code (suggestions welcome!)
Development Environment:
- Linux Ubuntu 18.04
- G++ v8.3.0
- c++14
vpackuswb
+ shuffle to account for lane-crossing. Or unpack lo/hi a pair of 256-bit vectors with_mm256_setzero_si256()
in the first place so repacking is just in-lanevpackuswb
. – Peter Cordesvpmaddubsw
, if you can make that work. (It treats one input as signed, the other as unsigned, so it won't work easily for pixels * alpha, except maybe with a range-shift to signed and then adjust? But saturation is a problem so no, I don't think so). – Peter Cordes