I am trying to optimize my alpha blending code with SIMD. SSE2, specifically.
At first I was hoping for SSE2, but at this point I would settle for SSE4.2 if it's easier. Reason being is that if I use SSE4.2 instead of SSE2, I cut out a significant number of older processors that can run this code. But at this point I'd take the compromise.
I am blitting a sprite onto the screen. Everything is in full 32-bit color, ARGB or BGRA, depending on which direction you read it.
I have read every other seemingly related question on SO and everything I could find on the web but I still have not been able to completely wrap my brain around this one specific concept, and I would appreciate some help. I've been at this for days.
Below is my code. This code works, in that it produces the visual effect that I want. A bitmap is drawn onto the background buffer with alpha blending. Everything looks fine and as expected.
But you will see that even though it works, my code misses the point of SIMD entirely. It is operating on each byte one at a time, just like as if it were completely serialized, and therefore the code sees no performance benefit over my more traditional code that operates on just one pixel at a time. With SIMD, I obviously want to work on 4 pixels (or every channel of one pixel - 128 bits) at a time, in parallel. (I am profiling by measuring frames rendered per second.)
I want to just run the formula once for each channel, i.e., blend all of the red channel at once, all of the green channel at once, all of the blue channel at once, and all of the alpha channel at once. Or alternatively, every channel (RGBA) of one of the pixels at once.
Then I should start to see the full benefit of SIMD.
I feel like I probably need to do some things with masks, but nothing I have tried gets me there.
I would be very grateful for some help.
(This is the inner loop. It only handles 4 pixels. I put this inside of a loop where I iterate over 4 pixels at a time with XPixel+=4.)
__m128i BitmapQuadPixel = _mm_load_si128((uint32_t*)Bitmap->Memory + BitmapOffset);
__m128i BackgroundQuadPixel = _mm_load_si128((uint32_t*)gRenderSurface.Memory + MemoryOffset);;
__m128i BlendedQuadPixel;
// 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
// R G B A R G B A R G B A R G B A
// This is the red component of the first pixel.
BlendedQuadPixel.m128i_u8[0] = BitmapQuadPixel.m128i_u8[0] * BitmapQuadPixel.m128i_u8[3] / 255 + BackgroundQuadPixel.m128i_u8[0] * (255 - BitmapQuadPixel.m128i_u8[3]) / 255;
// This is the green component of the first pixel.
BlendedQuadPixel.m128i_u8[1] = BitmapQuadPixel.m128i_u8[1] * BitmapQuadPixel.m128i_u8[3] / 255 + BackgroundQuadPixel.m128i_u8[1] * (255 - BitmapQuadPixel.m128i_u8[3]) / 255;
// And so on...
BlendedQuadPixel.m128i_u8[2] = BitmapQuadPixel.m128i_u8[2] * BitmapQuadPixel.m128i_u8[3] / 255 + BackgroundQuadPixel.m128i_u8[2] * (255 - BitmapQuadPixel.m128i_u8[3]) / 255;
BlendedQuadPixel.m128i_u8[4] = BitmapQuadPixel.m128i_u8[4] * BitmapQuadPixel.m128i_u8[7] / 255 + BackgroundQuadPixel.m128i_u8[4] * (255 - BitmapQuadPixel.m128i_u8[7]) / 255;
BlendedQuadPixel.m128i_u8[5] = BitmapQuadPixel.m128i_u8[5] * BitmapQuadPixel.m128i_u8[7] / 255 + BackgroundQuadPixel.m128i_u8[5] * (255 - BitmapQuadPixel.m128i_u8[7]) / 255;
BlendedQuadPixel.m128i_u8[6] = BitmapQuadPixel.m128i_u8[6] * BitmapQuadPixel.m128i_u8[7] / 255 + BackgroundQuadPixel.m128i_u8[6] * (255 - BitmapQuadPixel.m128i_u8[7]) / 255;
BlendedQuadPixel.m128i_u8[8] = BitmapQuadPixel.m128i_u8[8] * BitmapQuadPixel.m128i_u8[11] / 255 + BackgroundQuadPixel.m128i_u8[8] * (255 - BitmapQuadPixel.m128i_u8[11]) / 255;
BlendedQuadPixel.m128i_u8[9] = BitmapQuadPixel.m128i_u8[9] * BitmapQuadPixel.m128i_u8[11] / 255 + BackgroundQuadPixel.m128i_u8[9] * (255 - BitmapQuadPixel.m128i_u8[11]) / 255;
BlendedQuadPixel.m128i_u8[10] = BitmapQuadPixel.m128i_u8[10] * BitmapQuadPixel.m128i_u8[11] / 255 + BackgroundQuadPixel.m128i_u8[10] * (255 - BitmapQuadPixel.m128i_u8[11]) / 255;
BlendedQuadPixel.m128i_u8[12] = BitmapQuadPixel.m128i_u8[12] * BitmapQuadPixel.m128i_u8[15] / 255 + BackgroundQuadPixel.m128i_u8[12] * (255 - BitmapQuadPixel.m128i_u8[15]) / 255;
BlendedQuadPixel.m128i_u8[13] = BitmapQuadPixel.m128i_u8[13] * BitmapQuadPixel.m128i_u8[15] / 255 + BackgroundQuadPixel.m128i_u8[13] * (255 - BitmapQuadPixel.m128i_u8[15]) / 255;
BlendedQuadPixel.m128i_u8[14] = BitmapQuadPixel.m128i_u8[14] * BitmapQuadPixel.m128i_u8[15] / 255 + BackgroundQuadPixel.m128i_u8[14] * (255 - BitmapQuadPixel.m128i_u8[15]) / 255;
_mm_store_si128((uint32_t*)gRenderSurface.Memory + MemoryOffset, BlendedQuadPixel);
_mm_blend_epi16
or pack back to bytes and blend with_mm_blendv_epi8
or AND/ANDN/OR. It's ok if you compute a useless result, just blend with a vector that has the right result. – Peter Cordes