2
votes

I have two arrays of double type and I want to perform vecA += vecB. So far, I am doing vecA = vecA + vecB and as far as I know, for e.g. integers writing i = i + 5 is slower than i += 5. So I am wondering, whether there is some SSE function to do just an operator+= on __m128d. I searched and found nothing. My application spends approx 60% of time on this vecA = vecA + vecB operation , so any performance gains will show.

All arrays in the code snippets below are 16-bytes aligned and len is always even.

The original code is simply

inline void addToDoubleVectorSSE(
         const double * what, const double * toWhat, double * dest, const unsigned int len)
{
   __m128d * _what      = (__m128d*)what;
   __m128d * _toWhat    = (__m128d*)toWhat;

   for ( register unsigned int i = 0; i < len; i+= 2 )
   {
       *_toWhat = _mm_add_pd( *_what, *_toWhat );
       _what++;
       _toWhat++;
   }
}

After reading http://fastcpp.blogspot.cz/2011/04/how-to-process-stl-vector-using-sse.html where the author gains performance by not writing immediately into what he just read from, I tried

__m128d * _what         = (__m128d*)what;
__m128d * _toWhat       = (__m128d*)toWhat;
__m128d * _toWhatBase   = (__m128d*)toWhat;

__m128d _dest1;
__m128d _dest2;

for ( register unsigned int i = 0; i < len; i+= 4 )
{
    _toWhatBase = _toWhat;
    _dest1      = _mm_add_pd( *_what++, *_toWhat++ );
    _dest2      = _mm_add_pd( *_what++, *_toWhat++ );

    *_toWhatBase++ = _dest1;
    *_toWhatBase++ = _dest2;
}

but speedwise no improvement happens. So, is there any operator+= for __m128d? Or is there some other way that I can use to perform operator+= on arrays of doubles? The target platform is always going to be Windows (XP and 7) on Intel i7 CPUs, using MSVC.

2
What makes you say i = i + 5 is slower than i += 5?Carl Norum
I don't think it is related to c language. if you need some speedup, good idea would be to check the assembly code generated from your sources...V-X

2 Answers

3
votes

As far as I know, there is no equivalent of +=, because SSE arithmetic operations are generally register-to-register or memory-to-register, but not register-to-memory.

However, you can improve on your performance using the advise from the blog post that you linked. The reason the trick failed to work for you is that you did not eliminate the dependency between the two instructions: the side effects of the ++ increment in _what++ and _toWhat++ prevent the second pair of operations from starting at the same time. Modify your loop as follows to get an improvement:

for ( register unsigned int i = 0; i < len; i+= 4, _what += 2, _toWhat += 2, _toWhatBase+=2 )
{
    _toWhatBase = _toWhat;
    _dest1      = _mm_add_pd( *_what, *_toWhat );
    _dest2      = _mm_add_pd( *(_what+1), *(_toWhat+1));

    *_toWhatBase = _dest1;
    *(_toWhatBase+1) = _dest2;
}

After the change, the operation on _dest2 becomes independent of the operation on _dest1

According to my wall clock estimates, I got about 28% improvement after this simple modification.

4
votes

You are doing unnecessary work, modern compilers automatically generate this kind of code. The feature is called "auto-vectorization". MSVC supports it as well in VS2012. I couldn't make much sense of your code so I rewrote it like this:

inline void addToDoubleVectorSSE(
         const double * what, double * toWhat, const unsigned int len)
{
    for (unsigned ix = 0; ix < len; ++ix) 
        toWhat[ix] += what[ix];
}

Which produced this machine code:

00A3102E  xor         eax,eax  
00A31030  movupd      xmm0,xmmword ptr [esp+eax+358h]  
00A31039  movupd      xmm1,xmmword ptr [esp+eax+38h]  
00A3103F  add         eax,10h  
00A31042  addpd       xmm1,xmm0                          // <=== Look!!
00A31046  movupd      xmmword ptr [esp+eax+348h],xmm1  
00A3104F  cmp         eax,320h  
00A31054  jb          wmain+30h (0A31030h) 

Clearly you should favor this solution given how much cleaner the code looks. Update your VS version if necessary.