After doing few operations I got following three intermediate vectors.
__m256 Vec1 = [a0 a1 a2 a3 a4 a5 a6 a7]; //8 float values
__m256 Vec2 = [b0 b1 b2 b3 b4 b5 b6 b7]; //8 float values
__m256 Vec3 = [c0 c1 c2 c3 c4 c5 c6 c7]; //8 float values
I should rearrange these vectors as shown below for further processing.
__m256 ReVec1 = [a0 a1 b0 b1 c0 c1 a2 a3];
__m256 ReVec2 = [b2 b3 c2 c3 a4 a5 b4 b5];
__m256 ReVec3 = [c4 c5 a6 a7 b6 b7 c6 c7];
How can I shuffle three Vectors in AVX?
vshufpd
as a building block which is a 2-input in-lane shuffle of 64-bit chunks, allowing a different immediate control for the high and low lanes. (But with restrictions on what can come from where). Or possibly evenvpalignr
to do in-lane shifting in of bytes? Immediate blends are efficient; if you can blend the new elements in to vectors, then you can put things in the right order withvpermpd
(lane crossing) - Peter Cordesb
's elements could be achieved by an in-lane shuffle to swap 64-bit halves of each 128-bit lane, ready to blend into each of the 3 outputs.a
andc
's elements could get where they're needed with onevpermpd
lane-crossing shuffle each. Doing better than 3 shuffles + 6 blends would require finding ways to use 2-input shuffles, I think. - Peter Cordesa0a1
,a2a3
is one element each, this is like a 3x4 transpose. For 3x8 transpose, Intel published an article once, which probably is adaptable to 3x4: software.intel.com/content/www/us/en/develop/articles/…. You may be more efficient if either the input or the output is actually from/to memory (no need for lane-crossing shuffles) - chtza
andc
elements together which needs to be shuffled between lanes (to get[c4c5 a6a7 c0c1 a2a3]
). And it is possible to replace 2 blends + 1 shuffle by two 2-input shuffles ([a0a1 b0b1 a4a5 b4b5]
and[b2b3 c2c3 b6b7 c6c7]
). The rest is just 3 blends. - chtz