How does one efficiently perform horizontal addition with floats in a 512-bit AVX register (ie add the items from a single vector together)? For 128 and 256 bit registers this can be done using _mm_hadd_ps and _mm256_hadd_ps but there is no _mm512_hadd_ps. The Intel intrinsics guide documents _mm512_reduce_add_ps. It doesn't actually correspond to a single instruction but its existence suggests there is an optimal method, but it doesn't appear to be defined in the header files that come with the latest snapshot of GCC and I can't find a definition for it with Google.
I figure "hadd" can be emulated with _mm512_shuffle_ps and _mm512_add_ps or I could use _mm512_extractf32x4_ps to break a 512-bit register into four 128-bit registers but I want to make sure I'm not missing something better.
_mm512_reduce_add_ps
, exists for that purpose and compiles to a binary reduction of shuffles and sums.) – MysticialAMD Bulldozer/Piledriver/Steamroller
– Marat Dukhan