This question is related to a previously answered question: Fast 24-bit array -> 32-bit array conversion? In one answer, interjay kindly posted SSE3 code for converting RGB24 -> RGB32, however I also need the reverse conversion (RGB32 -> RGB24). I gave it a shot (see below) and my code definitely works, but it's more complicated than interjay's code, and noticeably slower too. I couldn't see how to exactly reverse the instructions: _mm_alignr_epi8 doesn't seem helpful in this case, but I'm not as familiar with SSE3 as I should be. Is the asymmetry unavoidable, or is there a faster substitute for the shifts and ORing?
RGB32 -> RGB24:
__m128i *src = ...
__m128i *dst = ...
__m128i mask = _mm_setr_epi8(0,1,2,4, 5,6,8,9, 10,12,13,14, -1,-1,-1,-1);
for (UINT i = 0; i < Pixels; i += 16) {
__m128i sa = _mm_shuffle_epi8(_mm_load_si128(src), mask);
__m128i sb = _mm_shuffle_epi8(_mm_load_si128(src + 1), mask);
__m128i sc = _mm_shuffle_epi8(_mm_load_si128(src + 2), mask);
__m128i sd = _mm_shuffle_epi8(_mm_load_si128(src + 3), mask);
_mm_store_si128(dst, _mm_or_si128(sa, _mm_slli_si128(sb, 12)));
_mm_store_si128(dst + 1, _mm_or_si128(_mm_srli_si128(sb, 4), _mm_slli_si128(sc, 8)));
_mm_store_si128(dst + 2, _mm_or_si128(_mm_srli_si128(sc, 8), _mm_slli_si128(sd, 4)));
src += 4;
dst += 3;
}
RGB24 -> RGB32 (courtesy interjay):
__m128i *src = ...
__m128i *dst = ...
__m128i mask = _mm_setr_epi8(0,1,2,-1, 3,4,5,-1, 6,7,8,-1, 9,10,11,-1);
for (UINT i = 0; i < Pixels; i += 16) {
__m128i sa = _mm_load_si128(src);
__m128i sb = _mm_load_si128(src + 1);
__m128i sc = _mm_load_si128(src + 2);
__m128i val = _mm_shuffle_epi8(sa, mask);
_mm_store_si128(dst, val);
val = _mm_shuffle_epi8(_mm_alignr_epi8(sb, sa, 12), mask);
_mm_store_si128(dst + 1, val);
val = _mm_shuffle_epi8(_mm_alignr_epi8(sc, sb, 8), mask);
_mm_store_si128(dst + 2, val);
val = _mm_shuffle_epi8(_mm_alignr_epi8(sc, sc, 4), mask);
_mm_store_si128(dst + 3, val);
src += 3;
dst += 4;
}
or
, aspshufb
sets a byte either to zero or the value indexed by the mask. – Gunther Piez