AVX; byte multiplication; sum;

Question

I'm optimising the following code with AVX and want to know your opinion about the best approach.

There are two blocks of data uint8 x[3][3]; uint8 y[3][3]; result is uint8 value which is sum of multiplication of corresponding elements like

res = (x[0][0]*y[0][0] + x[0][1]*y[0][1] + ... + x[3][3]*y[3][3]) >> NN

my concerns are:

the result of x[0][0]*y[0][0] is uint16, so before any multiplications I need to unpack uint8 into uint16 which is extra instructions.
the sum is uint32value, so before the merging multiplication results I need to unpack uint16 into uint32. It's also overhead.

Is the any simpler/faster way to do the same math without extra unpack instructions?

Is there a way to multiply bytes and get uint32 or uint16 result w/o extra data conversions?

Thanks.

PS: x[3][3] and y[3][3] are both in a range [0...255]

The first step could be done by (v)PUNPCKLBW, the second by (V)PMADDWD + horizontal add. If you didn't have the requirement of uint16_t-intermediaries, you might ba able to instead use (V)PMADDUBSW for the first step, but the second step would be more complicated. — EOF
PMADDUBSW is multiplication of signed & unsigned bytes, not unsigned & unsigned. Unpacking (PUNPCKLBW) is exactly what I want to avoid. — user3124812
If you know that one of your sources of bytes are signed positive, or unsigned and less than 0x7F, then you can use pmaddubsw. Otherwise not. I forget if AVX-512 has something. Welcome to the joys of SSE/AVX's highly non-orthogonal choice of instructions. I often happens that the perfect operation is available, but not for the element size or signedness you need. You're probably going to need a punpck or pmovzx. — Peter Cordes