6
votes

How does one efficiently perform horizontal addition with floats in a 512-bit AVX register (ie add the items from a single vector together)? For 128 and 256 bit registers this can be done using _mm_hadd_ps and _mm256_hadd_ps but there is no _mm512_hadd_ps. The Intel intrinsics guide documents _mm512_reduce_add_ps. It doesn't actually correspond to a single instruction but its existence suggests there is an optimal method, but it doesn't appear to be defined in the header files that come with the latest snapshot of GCC and I can't find a definition for it with Google.

I figure "hadd" can be emulated with _mm512_shuffle_ps and _mm512_add_ps or I could use _mm512_extractf32x4_ps to break a 512-bit register into four 128-bit registers but I want to make sure I'm not missing something better.

3
What exactly are you trying to do with a horizontal operation? If it's the end of a large reduction operation, then it probably isn't even performance-critical. (Nevertheless, _mm512_reduce_add_ps, exists for that purpose and compiles to a binary reduction of shuffles and sums.)Mysticial
I'm not surprised this is missing, as AVX-512 is viewed a bit as a departure from the standard "double the width" improvement. Operations are already cut up into 128-bit or 256-bit uops, so horizontal instructions wouldn't make much sense yet.Cory Nelson
@CoryNelson To make it worse, horizontal instructions are microcoded on existing processors. So they're already slow. And also, horizontally vectorized tasks violate the SIMD paradigm and don't scale.Mysticial
@Mystical Horizontal operations are microcoded only on AMD Bulldozer/Piledriver/SteamrollerMarat Dukhan
@MaratDukhan According to Agner Fog's tables, they are also microcoded on Prescott, Core 2, Nehalem, Sandy Bridge, Haswell, Atom, and Via Nano. Which pretty much covers everything else. He doesn't have any information on K10. And the entry is blank for K8.Mysticial

3 Answers

5
votes

The INTEL compiler has the following intrinsic defined to do horizontal sums

_mm512_reduce_add_ps     //horizontal sum of 16 floats
_mm512_reduce_add_pd     //horizontal sum of 8 doubles
_mm512_reduce_add_epi32  //horizontal sum of 16 32-bit integers
_mm512_reduce_add_epi64  //horizontal sum of 8 64-bit integers

However, as far as I can tell these are broken into multiple instructions anyway so I don't think you gain anything more than doing the horizontal sum of the upper and lower part of the AVX512 register.

__m256 low  = _mm512_castps512_ps256(zmm);
__m256 high = _mm256_castpd_ps(_mm512_extractf64x4_pd(_mm512_castps_pd(zmm),1));

__m256d low  = _mm512_castpd512_pd256(zmm);
__m256d high = _mm512_extractf64x4_pd(zmm,1);

__m256i low  = _mm512_castsi512_si256(zmm);
__m256i high = _mm512_extracti64x4_epi64(zmm,1);

To get the horizontal sum you then do sum = horizontal_add(low + high).

static inline float horizontal_add (__m256 a) {
    __m256 t1 = _mm256_hadd_ps(a,a);
    __m256 t2 = _mm256_hadd_ps(t1,t1);
    __m128 t3 = _mm256_extractf128_ps(t2,1);
    __m128 t4 = _mm_add_ss(_mm256_castps256_ps128(t2),t3);
    return _mm_cvtss_f32(t4);        
}

static inline double horizontal_add (__m256d a) {
    __m256d t1 = _mm256_hadd_pd(a,a);
    __m128d t2 = _mm256_extractf128_pd(t1,1);
    __m128d t3 = _mm_add_sd(_mm256_castpd256_pd128(t1),t2);
    return _mm_cvtsd_f64(t3);        
}

I got all this information and functions from Agner Fog's Vector Class Library and the Intel Instrinsics Guide online.

0
votes

I'll give Z boson the check, as the post does answer my question, but I think the exact sequence of instructions can be improved upon:

inline float horizontal_add(__m512 a) {
    __m512 tmp = _mm512_add_ps(a,_mm512_shuffle_f32x4(a,a,_MM_SHUFFLE(0,0,3,2)));
    __m128 r = _mm512_castps512_ps128(_mm512_add_ps(tmp,_mm512_shuffle_f32x4(tmp,tmp,_MM_SHUFFLE(0,0,0,1))));
    r = _mm_hadd_ps(r,r);
    return _mm_cvtss_f32(_mm_hadd_ps(r,r));
}
0
votes

horizontal sum for double precision:

static inline double _mm512_horizontal_add(__m512d a){
    __m256d b = _mm256_add_pd(_mm512_castpd512_pd256(a), _mm512_extractf64x4_pd(a,1));
    __m128d d = _mm_add_pd(_mm256_castpd256_pd128(b), _mm256_extractf128_pd(b,1));
    double *f = (double*)&d;
    return _mm_cvtsd_f64(d) + f[1];
}

edit: applied comments of Peter Cordes