Is there any way to rebuild the _mm_slli_si128
instruction in AVX2 to shift an __mm256i
register by x bytes?
The _mm256_slli_si256
seems just to execute two _mm_slli_si128
on a[127:0] and a[255:128].
The left shift should work on a __m256i
like this:
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, ..., 32] -> [2, 3, 4, 5, 6, 7, 8, 9, ..., 0]
I saw in thread that it is possible to create a shift with _mm256_permutevar8x32_ps
for 32bit. But I need a more generic solution to shift by x bytes. Has anybody already a solution for this problem?
VPERMD y,y,y
,VPERMQ y,y,i
, andVPERM2I128 y,y,y,i
are all 1uop, lat=3c, throughput=1/cycle. (And all run on port5 only in Haswell.) I agree, if you can structure things to work without crossing lanes all the time, that's best. But if your algo inherently benefits, and the extra latency isn't killer, then it could be a win. – Peter Cordes