Fast dot product using SSE/AVX intrinsics

Question

I am looking for a fast way to calculate the dot product of vectors with 3 or 4 components. I tried several things, but most examples online use an array of floats while our data structure is different.

We use structs which are 16 byte aligned. Code excerpt (simplified):

struct float3 {
    float x, y, z, w; // 4th component unused here
}

struct float4 {
    float x, y, z, w;
}

In previous tests (using SSE4 dot product intrinsic or FMA) I could not get a speedup, compared to using the following regular c++ code.

float dot(const float3 a, const float3 b) {
    return a.x*b.x + a.y*b.y + a.z*b.z;
}

Tests were done with gcc and clang on Intel Ivy Bridge / Haswell. It seems that the time spend to load the data into the SIMD registers and pulling them out again kills alls the benefits.

I would appreciate some help and ideas, how the dot product can be efficiently calculated using our float3/4 data structures. SSE4, AVX or even AVX2 is fine.

Editor's note: for the 4-element case, see How to Calculate single-vector Dot Product using SSE intrinsic functions in C. That with masking is maybe good for the 3-element case, too.

Have you checked the generated ASM? For gcc, you can turn on generation of ASM output using the -S switch (the output is written to the target given with -o). What are your compilation options? Is it possible that gcc produces SSE code already? — Jonas Schäfer
As a rule of thumb, SSE speeds things up only if you have a lot of calculations without leaving SSE registers. What you have in your dot function looks as not enough (and confirmed by your tests too). If you have something-larger which includes a call to dot() (ideally a loop which calls dot() a thousand times, and the whole loop can be implemented as SSE) - then you have a good chance for overall speedup. — No-Bugs Hare
It would be helpful to see more context, particularly the code that calls dot. Are you calling dot in a loop, e.g. for an array of float3 or float4 ? — Paul R
As you've noted, you need to amortize the overhead of loading data to SSE registers. Consider a different data layout such as a structure-of-arrays (SoA). SSE isn't going to help for a one-off dot product. Calculating a block of 4 dot products 'vertically' (rather than a 'horizontal' DPPS) takes advantage of SIMD, where one block can start while an earlier block is already in-flight. — Brett Hale

Z boson Z boson · Accepted Answer · 2015-06-02T12:55:58

Algebraically, efficient SIMD looks almost identical to scalar code. So the right way to do the dot product is to operate on four float vectors at once for SEE (eight with AVX).

Consider constructing your code like this

#include <x86intrin.h>

struct float4 {
    __m128 xmm;
    float4 () {};
    float4 (__m128 const & x) { xmm = x; }
    float4 & operator = (__m128 const & x) { xmm = x; return *this; }
    float4 & load(float const * p) { xmm = _mm_loadu_ps(p); return *this; }
    operator __m128() const { return xmm; }
};

static inline float4 operator + (float4 const & a, float4 const & b) {
    return _mm_add_ps(a, b);
}
static inline float4 operator * (float4 const & a, float4 const & b) {
    return _mm_mul_ps(a, b);
}

struct block3 {
    float4 x, y, z;
};

struct block4 {
    float4 x, y, z, w;
};

static inline float4 dot(block3 const & a, block3 const & b) {
    return a.x*b.x + a.y*b.y + a.z*b.z;
}

static inline float4 dot(block4 const & a, block4 const & b) {
    return a.x*b.x + a.y*b.y + a.z*b.z + a.w*b.w;
}

Notice that the last two functions look almost identical to your scalar dot function except that float becomes float4 and float4 becomes block3 or block4. This will do the dot product most efficiently.

Fast dot product using SSE/AVX intrinsics

2 Answers