3
votes

I was wondering if there is an SSE2/AVX2 integer instruction or sequence of instructions(or intrinsics) to be performed in order to achieve the following result:

Given a row of 8 byte pixels of the form:

A = {a, b, c, d, e, f, g, h}

Is there any way to load these pixels in an YMM register that contains 8 32bit ARGB pixels, such that the initial grayscale value gets broadcast to the other 2 bytes of each corresponding 32 bit pixel? The result should be something like this: ( the 0 is the alpha value )

B = {0aaa, 0bbb, 0ccc, 0ddd, 0eee, 0fff, 0ggg, 0hhh}

I'm a complete beginner in vector extensions so I'm not even sure how to approach this, or if it's at all possible.

Any help would be appreciated. Thanks!

UPDATE1

Thanks for your answers. I still have a problem though:

I put this small example together and compiled with VS2015 on x64.

int main()
{
    unsigned char* pixels = (unsigned char*)_aligned_malloc(64, 32);
    memset(pixels, 0, 64);

    for (unsigned char i = 0; i < 8; i++)
        pixels[i] = 0xaa + i;

    __m128i grayscalePix = _mm_load_si128((const __m128i*)pixels);
    __m256i rgba = _mm256_cvtepu8_epi32(grayscalePix);
    __m256i mulOperand = _mm256_set1_epi32(0x00010101);

    __m256i result = _mm256_mullo_epi32(rgba, mulOperand);

   _aligned_free(pixels);
    return 0;
}

The problem is that after doing

__m256i rgba = mm256_cvtepu8_epi32(grayscalePix)

rgba only has the first four doublewords set. The last four are all 0.

The Intel developer manual says:

VPMOVZXBD ymm1, xmm2/m64
Zero extend 8 packed 8-bit integers in the low 8 bytes of xmm2/m64 to 8 packed 32-bit integers in ymm1.

I'm not sure if this is intended behaviour or I'm still missing something.

Thanks.

3
Your code looks right. Are you sure you're not just testing it wrong? Or that the compiler didn't optimize some / all of it away because the results are unused? On Godbolt, I had to use -O0 to make the compiler keep the vector ops. Even -Og or -O1 optimized away everything except the malloc/free. Try storing the vector into an uint32_t array and printing it with printf, or something C++ish. - Peter Cordes
The optimizer is not a concern as I was testing this in debug mode but you were still right though :) Apparently, the VS debugger does not display _m256i values correctly. It almost feels like it truncates them at a _m128i boundary. Also, the registers window was not much help either. Everything looks fine after I stored the vector to memory and did a printf, so I guess thanks are in order :) - redeye
Oh wow, things are BAD when you can't trust the debugger! Does it get any better with the debugger when you are using the result? - Peter Cordes
I didn't really bother with the looking at _m256i values in the debugger anymore. When I need to test my code for correctness, I use #ifdef _DEBUG code where I just copy everything to memory and look at it there. - redeye

3 Answers

4
votes

Update: @chtz's answer is an even better idea, using a cheap 128->256 broadcast load instead of vpmovzx to feed vpshufb, saving shuffle-port bandwidth.


Start with PMOVZX like Mark suggests.

But after that, PSHUFB (_mm256_shuffle_epi8) will be much faster than PMULLD, except that it competes for the shuffle port with PMOVZX. (And it operates in-lane, so you still need the PMOVZX).

So if you only care about throughput, not latency, then _mm256_mullo_epi32 is good. But if latency matters, or if your throughput bottlenecks on something other than 2 shuffle-port instructions per vector anyway, then PSHUFB to duplicate the bytes within each pixel should be best.

Actually, even for throughput, _mm256_mullo_epi32 is bad on HSW and BDW: it's 2 uops (10c latency) for p0, so it's 2 uops for one port.

On SKL, it's 2 uops (10c latency) for p01, so it can sustain the same one per clock throughput as VPMOVZXBD. But it's an extra 1 uop, making it more likely to bottleneck.

(VPSHUFB is 1 uop, 1c latency, for port 5, on all Intel CPUs that support AVX2.)

2
votes

You can load the packed bytes into a register, call __m256i _mm256_cvtepu8_epi32 (__m128i a) to convert to 32 bit values, then multiply by 0x00010101 to replicate the gray scale into R,G and B.

1
votes

You can convert 16 pixels with one vbroadcasti128 and two vpshufb. The broadcast does not require port 5, if it directly loads from memory, so the shuffles can fully utilize that port (it will still bottleneck on that port, or on storing back to memory).

void gray2rgba(char const* input, char* output, size_t length)
{
    length &= size_t(-16); // lets just care about sizes multiples of 16 here ...

    __m256i shuflo = _mm256_setr_epi32(
        0x80000000, 0x80010101, 0x80020202, 0x80030303,
        0x80040404, 0x80050505, 0x80060606, 0x80070707
    );
    __m256i shufhi = _mm256_setr_epi32(
        0x80080808, 0x80090909, 0x800a0a0a, 0x800b0b0b,
        0x800c0c0c, 0x800d0d0d, 0x800e0e0e, 0x800f0f0f
    );

    for(size_t i=0; i<length; i+=16)
    {
        __m256i in = _mm256_broadcastsi128_si256(*reinterpret_cast<const __m128i*>(input+i));
        __m256i out0 = _mm256_shuffle_epi8(in, shuflo);
        __m256i out1 = _mm256_shuffle_epi8(in, shufhi);
        _mm256_storeu_si256(reinterpret_cast<__m256i*>(output+4*i),    out0);
        _mm256_storeu_si256(reinterpret_cast<__m256i*>(output+4*i+32), out1);
    }
}

Godbolt Demo: https://godbolt.org/z/dUx6GZ