SSE operator+= for vectors

Question

I have two arrays of double type and I want to perform vecA += vecB. So far, I am doing vecA = vecA + vecB and as far as I know, for e.g. integers writing i = i + 5 is slower than i += 5. So I am wondering, whether there is some SSE function to do just an operator+= on __m128d. I searched and found nothing. My application spends approx 60% of time on this vecA = vecA + vecB operation , so any performance gains will show.

All arrays in the code snippets below are 16-bytes aligned and len is always even.

The original code is simply

inline void addToDoubleVectorSSE(
         const double * what, const double * toWhat, double * dest, const unsigned int len)
{
   __m128d * _what      = (__m128d*)what;
   __m128d * _toWhat    = (__m128d*)toWhat;

   for ( register unsigned int i = 0; i < len; i+= 2 )
   {
       *_toWhat = _mm_add_pd( *_what, *_toWhat );
       _what++;
       _toWhat++;
   }
}

After reading http://fastcpp.blogspot.cz/2011/04/how-to-process-stl-vector-using-sse.html where the author gains performance by not writing immediately into what he just read from, I tried

__m128d * _what         = (__m128d*)what;
__m128d * _toWhat       = (__m128d*)toWhat;
__m128d * _toWhatBase   = (__m128d*)toWhat;

__m128d _dest1;
__m128d _dest2;

for ( register unsigned int i = 0; i < len; i+= 4 )
{
    _toWhatBase = _toWhat;
    _dest1      = _mm_add_pd( *_what++, *_toWhat++ );
    _dest2      = _mm_add_pd( *_what++, *_toWhat++ );

    *_toWhatBase++ = _dest1;
    *_toWhatBase++ = _dest2;
}

but speedwise no improvement happens. So, is there any operator+= for __m128d? Or is there some other way that I can use to perform operator+= on arrays of doubles? The target platform is always going to be Windows (XP and 7) on Intel i7 CPUs, using MSVC.

I don't think it is related to c language. if you need some speedup, good idea would be to check the assembly code generated from your sources... — V-X

Sergey Kalinichenko Sergey Kalinichenko · Accepted Answer · 2013-02-27T22:44:22

As far as I know, there is no equivalent of +=, because SSE arithmetic operations are generally register-to-register or memory-to-register, but not register-to-memory.

However, you can improve on your performance using the advise from the blog post that you linked. The reason the trick failed to work for you is that you did not eliminate the dependency between the two instructions: the side effects of the ++ increment in _what++ and _toWhat++ prevent the second pair of operations from starting at the same time. Modify your loop as follows to get an improvement:

for ( register unsigned int i = 0; i < len; i+= 4, _what += 2, _toWhat += 2, _toWhatBase+=2 )
{
    _toWhatBase = _toWhat;
    _dest1      = _mm_add_pd( *_what, *_toWhat );
    _dest2      = _mm_add_pd( *(_what+1), *(_toWhat+1));

    *_toWhatBase = _dest1;
    *(_toWhatBase+1) = _dest2;
}

After the change, the operation on _dest2 becomes independent of the operation on _dest1

According to my wall clock estimates, I got about 28% improvement after this simple modification.

SSE operator+= for vectors

2 Answers