Fast 32-bit array -> 24-bit array conversion in SSE3? (RGB32 -> RGB24)

Question

This question is related to a previously answered question: Fast 24-bit array -> 32-bit array conversion? In one answer, interjay kindly posted SSE3 code for converting RGB24 -> RGB32, however I also need the reverse conversion (RGB32 -> RGB24). I gave it a shot (see below) and my code definitely works, but it's more complicated than interjay's code, and noticeably slower too. I couldn't see how to exactly reverse the instructions: _mm_alignr_epi8 doesn't seem helpful in this case, but I'm not as familiar with SSE3 as I should be. Is the asymmetry unavoidable, or is there a faster substitute for the shifts and ORing?

RGB32 -> RGB24:

__m128i *src = ...
__m128i *dst = ...
__m128i mask = _mm_setr_epi8(0,1,2,4, 5,6,8,9, 10,12,13,14, -1,-1,-1,-1);
for (UINT i = 0; i < Pixels; i += 16) {
    __m128i sa = _mm_shuffle_epi8(_mm_load_si128(src), mask);
    __m128i sb = _mm_shuffle_epi8(_mm_load_si128(src + 1), mask);
    __m128i sc = _mm_shuffle_epi8(_mm_load_si128(src + 2), mask);
    __m128i sd = _mm_shuffle_epi8(_mm_load_si128(src + 3), mask);
    _mm_store_si128(dst, _mm_or_si128(sa, _mm_slli_si128(sb, 12)));
    _mm_store_si128(dst + 1, _mm_or_si128(_mm_srli_si128(sb, 4), _mm_slli_si128(sc, 8)));
    _mm_store_si128(dst + 2, _mm_or_si128(_mm_srli_si128(sc, 8), _mm_slli_si128(sd, 4)));
    src += 4;
    dst += 3;
}

RGB24 -> RGB32 (courtesy interjay):

__m128i *src = ...
__m128i *dst = ...
__m128i mask = _mm_setr_epi8(0,1,2,-1, 3,4,5,-1, 6,7,8,-1, 9,10,11,-1);
for (UINT i = 0; i < Pixels; i += 16) {
    __m128i sa = _mm_load_si128(src);
    __m128i sb = _mm_load_si128(src + 1);
    __m128i sc = _mm_load_si128(src + 2);
    __m128i val = _mm_shuffle_epi8(sa, mask);
    _mm_store_si128(dst, val);
    val = _mm_shuffle_epi8(_mm_alignr_epi8(sb, sa, 12), mask);
    _mm_store_si128(dst + 1, val);
    val = _mm_shuffle_epi8(_mm_alignr_epi8(sc, sb, 8), mask);
    _mm_store_si128(dst + 2, val);
    val = _mm_shuffle_epi8(_mm_alignr_epi8(sc, sc, 4), mask);
    _mm_store_si128(dst + 3, val);
    src += 3;
    dst += 4;
}

You just need to use 6 masks on 4 input registers to convert them to 3 output registers. You can't get around the three or, as pshufb sets a byte either to zero or the value indexed by the mask. — Gunther Piez

MSN MSN · Accepted Answer · 2012-04-02T22:29:56

You can take this answer and change the shuffle mask to go from RGB32 to RGB24.

The big difference is to calculate the shuffles directly and use bitwise operations to avoid shifting. Plus, using the aligned streaming write instead of the aligned write doesn't taint the cache.

Fast 32-bit array -> 24-bit array conversion in SSE3? (RGB32 -> RGB24)

2 Answers