SIMD for alpha blending - how to operate on every Nth byte?

Question

I am trying to optimize my alpha blending code with SIMD. SSE2, specifically.

At first I was hoping for SSE2, but at this point I would settle for SSE4.2 if it's easier. Reason being is that if I use SSE4.2 instead of SSE2, I cut out a significant number of older processors that can run this code. But at this point I'd take the compromise.

I am blitting a sprite onto the screen. Everything is in full 32-bit color, ARGB or BGRA, depending on which direction you read it.

I have read every other seemingly related question on SO and everything I could find on the web but I still have not been able to completely wrap my brain around this one specific concept, and I would appreciate some help. I've been at this for days.

Below is my code. This code works, in that it produces the visual effect that I want. A bitmap is drawn onto the background buffer with alpha blending. Everything looks fine and as expected.

But you will see that even though it works, my code misses the point of SIMD entirely. It is operating on each byte one at a time, just like as if it were completely serialized, and therefore the code sees no performance benefit over my more traditional code that operates on just one pixel at a time. With SIMD, I obviously want to work on 4 pixels (or every channel of one pixel - 128 bits) at a time, in parallel. (I am profiling by measuring frames rendered per second.)

I want to just run the formula once for each channel, i.e., blend all of the red channel at once, all of the green channel at once, all of the blue channel at once, and all of the alpha channel at once. Or alternatively, every channel (RGBA) of one of the pixels at once.

Then I should start to see the full benefit of SIMD.

I feel like I probably need to do some things with masks, but nothing I have tried gets me there.

I would be very grateful for some help.

(This is the inner loop. It only handles 4 pixels. I put this inside of a loop where I iterate over 4 pixels at a time with XPixel+=4.)

__m128i BitmapQuadPixel = _mm_load_si128((uint32_t*)Bitmap->Memory + BitmapOffset);             

__m128i BackgroundQuadPixel = _mm_load_si128((uint32_t*)gRenderSurface.Memory + MemoryOffset);;

__m128i BlendedQuadPixel;



// 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15
// R  G  B  A  R  G  B  A  R  G  B  A  R  G  B  A  


// This is the red component of the first pixel.
BlendedQuadPixel.m128i_u8[0]  = BitmapQuadPixel.m128i_u8[0]  * BitmapQuadPixel.m128i_u8[3] / 255 + BackgroundQuadPixel.m128i_u8[0]  * (255 - BitmapQuadPixel.m128i_u8[3]) / 255;
// This is the green component of the first pixel.
BlendedQuadPixel.m128i_u8[1]  = BitmapQuadPixel.m128i_u8[1]  * BitmapQuadPixel.m128i_u8[3] / 255 + BackgroundQuadPixel.m128i_u8[1]  * (255 - BitmapQuadPixel.m128i_u8[3]) / 255;
// And so on...
BlendedQuadPixel.m128i_u8[2]  = BitmapQuadPixel.m128i_u8[2]  * BitmapQuadPixel.m128i_u8[3] / 255 + BackgroundQuadPixel.m128i_u8[2]  * (255 - BitmapQuadPixel.m128i_u8[3]) / 255;


BlendedQuadPixel.m128i_u8[4]  = BitmapQuadPixel.m128i_u8[4]  * BitmapQuadPixel.m128i_u8[7] / 255 + BackgroundQuadPixel.m128i_u8[4]  * (255 - BitmapQuadPixel.m128i_u8[7]) / 255;

BlendedQuadPixel.m128i_u8[5]  = BitmapQuadPixel.m128i_u8[5]  * BitmapQuadPixel.m128i_u8[7] / 255 + BackgroundQuadPixel.m128i_u8[5]  * (255 - BitmapQuadPixel.m128i_u8[7]) / 255;

BlendedQuadPixel.m128i_u8[6]  = BitmapQuadPixel.m128i_u8[6]  * BitmapQuadPixel.m128i_u8[7] / 255 + BackgroundQuadPixel.m128i_u8[6]  * (255 - BitmapQuadPixel.m128i_u8[7]) / 255;


BlendedQuadPixel.m128i_u8[8]  = BitmapQuadPixel.m128i_u8[8]  * BitmapQuadPixel.m128i_u8[11] / 255 + BackgroundQuadPixel.m128i_u8[8]  * (255 - BitmapQuadPixel.m128i_u8[11]) / 255;

BlendedQuadPixel.m128i_u8[9]  = BitmapQuadPixel.m128i_u8[9]  * BitmapQuadPixel.m128i_u8[11] / 255 + BackgroundQuadPixel.m128i_u8[9]  * (255 - BitmapQuadPixel.m128i_u8[11]) / 255;

BlendedQuadPixel.m128i_u8[10] = BitmapQuadPixel.m128i_u8[10] * BitmapQuadPixel.m128i_u8[11] / 255 + BackgroundQuadPixel.m128i_u8[10] * (255 - BitmapQuadPixel.m128i_u8[11]) / 255;


BlendedQuadPixel.m128i_u8[12] = BitmapQuadPixel.m128i_u8[12] * BitmapQuadPixel.m128i_u8[15] / 255 + BackgroundQuadPixel.m128i_u8[12] * (255 - BitmapQuadPixel.m128i_u8[15]) / 255;

BlendedQuadPixel.m128i_u8[13] = BitmapQuadPixel.m128i_u8[13] * BitmapQuadPixel.m128i_u8[15] / 255 + BackgroundQuadPixel.m128i_u8[13] * (255 - BitmapQuadPixel.m128i_u8[15]) / 255;

BlendedQuadPixel.m128i_u8[14] = BitmapQuadPixel.m128i_u8[14] * BitmapQuadPixel.m128i_u8[15] / 255 + BackgroundQuadPixel.m128i_u8[14] * (255 - BitmapQuadPixel.m128i_u8[15]) / 255;

_mm_store_si128((uint32_t*)gRenderSurface.Memory + MemoryOffset, BlendedQuadPixel);

Probably unpack bytes to 16-bit elements and use packed multiply. Blend with _mm_blend_epi16 or pack back to bytes and blend with _mm_blendv_epi8 or AND/ANDN/OR. It's ok if you compute a useless result, just blend with a vector that has the right result. — Peter Cordes
See SSE alpha blending for pre-multiplied ARGB. Also How to alpha blend RGBA unsigned byte color fast? has some suggestions, but doesn't show how to mask components. Faster alpha blending using a lookup table? has an SSE2 answer that doesn't use a LUT, doing blending with AND/ANDNOT/OR. — Peter Cordes

mainactual mainactual · Accepted Answer · 2018-12-10T13:57:43

As I see gRenderSurface, I wonder whether you should just blend images on GPU, e.g., using GLSL shader, or if not, reading memory back from the render surface can be very slow. Anyways, here's my cup of tea using SSE4.1 as I didn't find fully similar linked in comments.

This one shuffles alpha bytes to all color channels using _aa and does the "one minus source alpha" blending by the final masking. With AVX2 it outperforms scalar implementation by factor ~5.7x, while the SSE4.1 version with separate low and high quadword processing is ~3.14x faster than the scalar implementation (both measured using Intel Compiler 19.0).

Division by 255 from How to divide 16-bit integer by 255 with using SSE?

const __m128i _aa = _mm_set_epi8( 15,15,15,15, 11,11,11,11, 7,7,7,7, 3,3,3,3 );
const __m128i _mask1 = _mm_set_epi16(-1,0,0,0, -1,0,0,0);
const __m128i _mask2 = _mm_set_epi16(0,-1,-1,-1, 0,-1,-1,-1);
const __m128i _v255 = _mm_set1_epi8( -1 );
const __m128i _v1 = _mm_set1_epi16( 1 );

const int xmax = 4*source.cols-15;
for ( int y=0;y<source.rows;++y )
{
    // OpenCV CV_8UC4 input
    const unsigned char * pS = source.ptr<unsigned char>( y );
    const unsigned char * pD = dest.ptr<unsigned char>( y );
    unsigned char *pOut = out.ptr<unsigned char>( y );
    for ( int x=0;x<xmax;x+=16 )
    {
        __m128i _src = _mm_loadu_si128( (__m128i*)( pS+x ) );
        __m128i _src_a = _mm_shuffle_epi8( _src, _aa );

        __m128i _dst = _mm_loadu_si128( (__m128i*)( pD+x ) );
        __m128i _dst_a = _mm_shuffle_epi8( _dst, _aa );
        __m128i _one_minus_src_a = _mm_subs_epu8( _v255, _src_a );

        __m128i _s_a = _mm_cvtepu8_epi16( _src_a );
        __m128i _s = _mm_cvtepu8_epi16( _src );
        __m128i _d = _mm_cvtepu8_epi16( _dst );
        __m128i _d_a = _mm_cvtepu8_epi16( _one_minus_src_a );
        __m128i _out = _mm_adds_epu16( _mm_mullo_epi16( _s, _s_a ), _mm_mullo_epi16( _d, _d_a ) );
        _out = _mm_srli_epi16( _mm_adds_epu16( _mm_adds_epu16( _v1, _out ), _mm_srli_epi16( _out, 8 ) ), 8 );
        _out = _mm_or_si128( _mm_and_si128(_out,_mask2), _mm_and_si128( _mm_adds_epu16(_s_a, _mm_cvtepu8_epi16(_dst_a)),_mask1) );

        __m128i _out2;
        // compute _out2 using high quadword of of the _src and _dst
        //...
        __m128i _ret = _mm_packus_epi16( _out, _out2 );
        _mm_storeu_si128( (__m128i*)(pOut+x), _ret );

SIMD for alpha blending - how to operate on every Nth byte?

1 Answers