I'm optimising the following code with AVX and want to know your opinion about the best approach.
There are two blocks of data
uint8 x[3][3];
uint8 y[3][3];
result is uint8
value which is sum of multiplication of corresponding elements like
res = (x[0][0]*y[0][0] + x[0][1]*y[0][1] + ... + x[3][3]*y[3][3]) >> NN
my concerns are:
the result of
x[0][0]*y[0][0]
isuint16
, so before any multiplications I need to unpackuint8
intouint16
which is extra instructions.the sum is
uint32
value, so before the merging multiplication results I need to unpackuint16
intouint32
. It's also overhead.
Is the any simpler/faster way to do the same math without extra unpack instructions?
Is there a way to multiply bytes and get uint32
or uint16
result w/o extra data conversions?
Thanks.
PS: x[3][3] and y[3][3] are both in a range [0...255]
(v)PUNPCKLBW
, the second by(V)PMADDWD
+ horizontal add. If you didn't have the requirement ofuint16_t
-intermediaries, you might ba able to instead use(V)PMADDUBSW
for the first step, but the second step would be more complicated. - EOFpmaddubsw
. Otherwise not. I forget if AVX-512 has something. Welcome to the joys of SSE/AVX's highly non-orthogonal choice of instructions. I often happens that the perfect operation is available, but not for the element size or signedness you need. You're probably going to need apunpck
orpmovzx
. - Peter Cordes