11
votes

avx introduced the instruction vperm2f128 (exposed via _mm256_permute2f128_si256), while avx2 introduced vperm2i128 (exposed via _mm256_permute2x128_si256).

They both seem to be doing exactly the same, and their respective latencies and throughputs also seem to be identical.

So why do both instructions exist? There has to be some reasoning behind that? Is there maybe something I have overlooked? Given that avx2 operates on data structures introduced with avx, I cannot imagine that a processor will ever exist that supports avx2 but not avx.

1
Same reason as why there are load/store and logical instructions for both integers and floating-point in SSE/AVX: some microarchitectures partition the vector-units between floating-point FPU and integer ALU (probably to improve locality on the chip, keeping wires short), so that moving data between floating-point and integer is slow.EOF
I don't completely follow your reasoning. The signature for the f version is __m256i _mm256_permute2f128_si256 (__m256i a, __m256i b, int imm8), and Intel's description reads Shuffle 128-bits (composed of integer data) (it's exactly the same as for permute2x128). Given that it takes __m256i as arguments, shouldn't they be in the integer ALU? Or are you saying a __m256i can be loaded in the FPU?mSSM

1 Answers

8
votes

There's a bit of a disconnect between the intrinsics and the actual instructions that are underneath.

AVX:

All 3 of these generate exactly the same instruction, vperm2f128:

  • _mm256_permute2f128_pd()
  • _mm256_permute2f128_ps()
  • _mm256_permute2f128_si256()

The only difference are the types - which don't exist at the instruction level.

vperm2f128 is a 256-bit floating-point instruction. In AVX, there are no "real" 256-bit integer SIMD instructions. So even though _mm256_permute2f128_si256() is an "integer" intrinsic, it's really just syntax sugar for this:

_mm256_castpd_si256(
    _mm256_permute2f128_pd(
        _mm256_castsi256_pd(x),
        _mm256_castsi256_pd(y),
        imm
    )
);

Which does a round trip from the integer domain to the FP domain - thus incurring bypass delays. As ugly as this looks, it is only way to do it in AVX-only land.

vperm2f128 isn't the only instruction to get this treatment, I find at least 3 of them:

  • vperm2f128 / _mm256_permute2f128_si256()
  • vextractf128 / _mm256_extractf128_si256()
  • vinsertf128 / _mm256_insertf128_si256()

Together, it seems that the usecase of these intrinsics is to load data as 256-bit integer vectors, and shuffle them into multiple 128-bit integer vectors for integer computation. Likewise the reverse where you store as 256-bit vectors.

Without these "hack" intrinsics, you would need to use a lot of cast intrinsics.

Either way, a competent compiler will try to optimize the types as well. Thus it will generate floating-point load/stores and shuffles even if you are using 256-bit integer loads. This reduces the number of bypass delays to only one layer. (when you go from FP-shuffle to 128-bit integer computation)


AVX2:

AVX2 cleans up this madness by adding proper 256-bit integer SIMD support for everything - including the shuffles.

The vperm2i128 instruction is new along with a new intrinsic for it, _mm256_permute2x128_si256().

This, along with _mm256_extracti128_si256() and _mm256_inserti128_si256() lets you do 256-bit integer SIMD and actually stay completely in the integer domain.


The distinction between integer FP versions of the same instructions has to do with bypass delays. In older processors, there were delays to move data from int <-> FP domains. While the SIMD registers themselves are type-agnostic, the hardware implementation isn't. And there is extra latency to get data output by an FP instruction to an input to an integer instruction. (and vice versa)

Thus it was important (from a performance standpoint) to use the correct instruction type to match the actual datatype that was being operated on.

On the newest processors (Skylake and later?), there doesn't seem to be anymore int/FP bypass delays with regards to the shuffle instructions. While the instruction set still has this distinction, shuffle instructions that do the same thing with different "types" probably map to the same uop now.