I need to perform a rotate operation with as little clock cycles as possible.
In the first case let's assume __m128i
as source and dest type:
source: || A0 || A1 || A2 || A3 ||
dest: || A1 || A2 || A3 || A0 ||
dest = (__m128i)_mm_shuffle_epi32((__m128i)source, _MM_SHUFFLE(0,3,2,1));
Now I want to do the same with AVX intrinsics.
So let's assume this time __m256i
as source and dest type:
source: || A0 || A1 || A2 || A3 || A4 || A5 || A6 || A7 ||
dest: || A1 || A2 || A3 || A4 || A5 || A6 || A7 || A0 ||
The AVX intrinsics is missing most of the corresponding SSE integer operations. Maybe there is some way go get the desired output working with the floating point version.
I've tried with:
dest = (__m256i)_mm256_shuffle_ps((__m256)source, (__m256)source, _MM_SHUFFLE(0,3,2,1));
but what I get is:
|| A0 || A2 || A3 || A4 || A5 || A6 || A7 || A1 ||
Any Idea on how to solve this in an efficient way? (without mixing SSE and AVX operation and without "manually" inverting A0
and A1
Thanks in advance!
__m256
, why are you casting to__m128i
? – dario_ramos