4
votes

This question is related to a previously answered question: Fast 24-bit array -> 32-bit array conversion? In one answer, interjay kindly posted SSE3 code for converting RGB24 -> RGB32, however I also need the reverse conversion (RGB32 -> RGB24). I gave it a shot (see below) and my code definitely works, but it's more complicated than interjay's code, and noticeably slower too. I couldn't see how to exactly reverse the instructions: _mm_alignr_epi8 doesn't seem helpful in this case, but I'm not as familiar with SSE3 as I should be. Is the asymmetry unavoidable, or is there a faster substitute for the shifts and ORing?

RGB32 -> RGB24:

__m128i *src = ...
__m128i *dst = ...
__m128i mask = _mm_setr_epi8(0,1,2,4, 5,6,8,9, 10,12,13,14, -1,-1,-1,-1);
for (UINT i = 0; i < Pixels; i += 16) {
    __m128i sa = _mm_shuffle_epi8(_mm_load_si128(src), mask);
    __m128i sb = _mm_shuffle_epi8(_mm_load_si128(src + 1), mask);
    __m128i sc = _mm_shuffle_epi8(_mm_load_si128(src + 2), mask);
    __m128i sd = _mm_shuffle_epi8(_mm_load_si128(src + 3), mask);
    _mm_store_si128(dst, _mm_or_si128(sa, _mm_slli_si128(sb, 12)));
    _mm_store_si128(dst + 1, _mm_or_si128(_mm_srli_si128(sb, 4), _mm_slli_si128(sc, 8)));
    _mm_store_si128(dst + 2, _mm_or_si128(_mm_srli_si128(sc, 8), _mm_slli_si128(sd, 4)));
    src += 4;
    dst += 3;
}

RGB24 -> RGB32 (courtesy interjay):

__m128i *src = ...
__m128i *dst = ...
__m128i mask = _mm_setr_epi8(0,1,2,-1, 3,4,5,-1, 6,7,8,-1, 9,10,11,-1);
for (UINT i = 0; i < Pixels; i += 16) {
    __m128i sa = _mm_load_si128(src);
    __m128i sb = _mm_load_si128(src + 1);
    __m128i sc = _mm_load_si128(src + 2);
    __m128i val = _mm_shuffle_epi8(sa, mask);
    _mm_store_si128(dst, val);
    val = _mm_shuffle_epi8(_mm_alignr_epi8(sb, sa, 12), mask);
    _mm_store_si128(dst + 1, val);
    val = _mm_shuffle_epi8(_mm_alignr_epi8(sc, sb, 8), mask);
    _mm_store_si128(dst + 2, val);
    val = _mm_shuffle_epi8(_mm_alignr_epi8(sc, sc, 4), mask);
    _mm_store_si128(dst + 3, val);
    src += 3;
    dst += 4;
}
2
You just need to use 6 masks on 4 input registers to convert them to 3 output registers. You can't get around the three or, as pshufb sets a byte either to zero or the value indexed by the mask.Gunther Piez

2 Answers

0
votes

You can take this answer and change the shuffle mask to go from RGB32 to RGB24.

The big difference is to calculate the shuffles directly and use bitwise operations to avoid shifting. Plus, using the aligned streaming write instead of the aligned write doesn't taint the cache.

0
votes

Old question, but I was trying to solve the same problem so...

You can use palignr if you right-align its second operand, that is put zeros in the low bytes. You need left-aligned version of the second, third and fourth word, and right-aligned versions of the first, second and third word.

For the second and third word, GCC is slightly happier if I use shifts to compute the right-aligned version from the left aligned one. If I use two different pshufb's, it generates 3 unnecessary moves.

Here is the code. It uses exactly 8 registers; if you're in 64-bit mode you can try unrolling it by two.

    __m128i mask_right = _mm_set_epi8(14, 13, 12, 10, 9, 8, 6, 5, 4, 2, 1, 0, 0x80, 0x80, 0x80, 0x80);
    __m128i mask = _mm_set_epi8(0x80, 0x80, 0x80, 0x80, 14, 13, 12, 10, 9, 8, 6, 5, 4, 2, 1, 0);

    for (; n; n -= 16, d += 48, s += 64) {
            __m128i v0 = _mm_load_si128((__m128i *) &s[0]);
            __m128i v1 = _mm_load_si128((__m128i *) &s[16]);
            __m128i v2 = _mm_load_si128((__m128i *) &s[32]);
            __m128i v3 = _mm_load_si128((__m128i *) &s[48]);

            v0 = _mm_shuffle_epi8(v0, mask_right);
            v1 = _mm_shuffle_epi8(v1, mask);
            v2 = _mm_shuffle_epi8(v2, mask);
            v3 = _mm_shuffle_epi8(v3, mask);

            v0 = _mm_alignr_epi8(v1, v0, 4);
            v1 = _mm_slli_si128(v1, 4);       // mask -> mask_right
            v1 = _mm_alignr_epi8(v2, v1, 8);
            v2 = _mm_slli_si128(v2, 4);       // mask -> mask_right
            v2 = _mm_alignr_epi8(v3, v2, 12);

            _mm_store_si128((__m128i *) &d[0], v0);
            _mm_store_si128((__m128i *) &d[16], v1);
            _mm_store_si128((__m128i *) &d[32], v2);
    }

The central part might also be written like this. The compiler produces one instruction less, and it looks like it has a bit more parallelism but benchmarking is needed to give the right answer:

            v0 = _mm_shuffle_epi8(v0, mask_right);
            v1 = _mm_shuffle_epi8(v1, mask);
            v2 = _mm_shuffle_epi8(v2, mask_right);
            v3 = _mm_shuffle_epi8(v3, mask);

            __m128i v2l = v2;
            v0 = _mm_alignr_epi8(v1, v0, 4);
            v1 = _mm_slli_si128(v1, 4);             // mask -> mask_right
            v2 = _mm_alignr_epi8(v3, v2, 12);
            v2l = _mm_srli_si128(v2l, 4);           // mask_right -> mask
            v1 = _mm_alignr_epi8(v2l, v1, 8);