0
votes

After doing few operations I got following three intermediate vectors.

__m256 Vec1 = [a0 a1 a2 a3 a4 a5 a6 a7];    //8 float values
__m256 Vec2 = [b0 b1 b2 b3 b4 b5 b6 b7];    //8 float values
__m256 Vec3 = [c0 c1 c2 c3 c4 c5 c6 c7];    //8 float values

I should rearrange these vectors as shown below for further processing.

__m256 ReVec1 = [a0 a1 b0 b1 c0 c1 a2 a3];
__m256 ReVec2 = [b2 b3 c2 c3 a4 a5 b4 b5];
__m256 ReVec3 = [c4 c5 a6 a7 b6 b7 c6 c7];

How can I shuffle three Vectors in AVX?

Which CPUs do you care about this being efficient on? Just Intel and/or Zen 2, or do you also care about Zen1 where lane-crossing shuffles are quite slow? (Not sure yet if there are multiple good options to pick from that would give different tradeoffs.) - Peter Cordes
Looks like elements always stay in pairs, so you can probably use vshufpd as a building block which is a 2-input in-lane shuffle of 64-bit chunks, allowing a different immediate control for the high and low lanes. (But with restrictions on what can come from where). Or possibly even vpalignr to do in-lane shifting in of bytes? Immediate blends are efficient; if you can blend the new elements in to vectors, then you can put things in the right order with vpermpd (lane crossing) - Peter Cordes
Ok, so the final position of b's elements could be achieved by an in-lane shuffle to swap 64-bit halves of each 128-bit lane, ready to blend into each of the 3 outputs. a and c's elements could get where they're needed with one vpermpd lane-crossing shuffle each. Doing better than 3 shuffles + 6 blends would require finding ways to use 2-input shuffles, I think. - Peter Cordes
If you pretend that a0a1, a2a3 is one element each, this is like a 3x4 transpose. For 3x8 transpose, Intel published an article once, which probably is adaptable to 3x4: software.intel.com/content/www/us/en/develop/articles/…. You may be more efficient if either the input or the output is actually from/to memory (no need for lane-crossing shuffles) - chtz
I found a solution with 6 blends and 2 shuffles (only 1 cross-lane shuffle). The trick is to first blend the a and c elements together which needs to be shuffled between lanes (to get [c4c5 a6a7 c0c1 a2a3]). And it is possible to replace 2 blends + 1 shuffle by two 2-input shuffles ([a0a1 b0b1 a4a5 b4b5] and [b2b3 c2c3 b6b7 c6c7]). The rest is just 3 blends. - chtz