I was wondering if there is an SSE2/AVX2 integer instruction or sequence of instructions(or intrinsics) to be performed in order to achieve the following result:
Given a row of 8 byte pixels of the form:
A = {a, b, c, d, e, f, g, h}
Is there any way to load these pixels in an YMM register that contains 8 32bit ARGB pixels, such that the initial grayscale value gets broadcast to the other 2 bytes of each corresponding 32 bit pixel? The result should be something like this: ( the 0 is the alpha value )
B = {0aaa, 0bbb, 0ccc, 0ddd, 0eee, 0fff, 0ggg, 0hhh}
I'm a complete beginner in vector extensions so I'm not even sure how to approach this, or if it's at all possible.
Any help would be appreciated. Thanks!
UPDATE1
Thanks for your answers. I still have a problem though:
I put this small example together and compiled with VS2015 on x64.
int main()
{
unsigned char* pixels = (unsigned char*)_aligned_malloc(64, 32);
memset(pixels, 0, 64);
for (unsigned char i = 0; i < 8; i++)
pixels[i] = 0xaa + i;
__m128i grayscalePix = _mm_load_si128((const __m128i*)pixels);
__m256i rgba = _mm256_cvtepu8_epi32(grayscalePix);
__m256i mulOperand = _mm256_set1_epi32(0x00010101);
__m256i result = _mm256_mullo_epi32(rgba, mulOperand);
_aligned_free(pixels);
return 0;
}
The problem is that after doing
__m256i rgba = mm256_cvtepu8_epi32(grayscalePix)
rgba only has the first four doublewords set. The last four are all 0.
The Intel developer manual says:
VPMOVZXBD ymm1, xmm2/m64
Zero extend 8 packed 8-bit integers in the low 8 bytes of xmm2/m64 to 8 packed 32-bit integers in ymm1.
I'm not sure if this is intended behaviour or I'm still missing something.
Thanks.
-O0
to make the compiler keep the vector ops. Even-Og
or-O1
optimized away everything except the malloc/free. Try storing the vector into anuint32_t
array and printing it withprintf
, or something C++ish. - Peter Cordes_m256i
values correctly. It almost feels like it truncates them at a_m128i
boundary. Also, the registers window was not much help either. Everything looks fine after I stored the vector to memory and did aprintf
, so I guess thanks are in order :) - redeye_m256i
values in the debugger anymore. When I need to test my code for correctness, I use#ifdef _DEBUG
code where I just copy everything to memory and look at it there. - redeye