I have two arrays of double
type and I want to perform vecA += vecB
. So far, I am doing vecA = vecA + vecB
and as far as I know, for e.g. integers writing i = i + 5
is slower than i += 5
. So I am wondering, whether there is some SSE function to do just an operator+=
on __m128d. I searched and found nothing. My application spends approx 60% of time on this vecA = vecA + vecB
operation , so any performance gains will show.
All arrays in the code snippets below are 16-bytes aligned and len
is always even.
The original code is simply
inline void addToDoubleVectorSSE(
const double * what, const double * toWhat, double * dest, const unsigned int len)
{
__m128d * _what = (__m128d*)what;
__m128d * _toWhat = (__m128d*)toWhat;
for ( register unsigned int i = 0; i < len; i+= 2 )
{
*_toWhat = _mm_add_pd( *_what, *_toWhat );
_what++;
_toWhat++;
}
}
After reading http://fastcpp.blogspot.cz/2011/04/how-to-process-stl-vector-using-sse.html where the author gains performance by not writing immediately into what he just read from, I tried
__m128d * _what = (__m128d*)what;
__m128d * _toWhat = (__m128d*)toWhat;
__m128d * _toWhatBase = (__m128d*)toWhat;
__m128d _dest1;
__m128d _dest2;
for ( register unsigned int i = 0; i < len; i+= 4 )
{
_toWhatBase = _toWhat;
_dest1 = _mm_add_pd( *_what++, *_toWhat++ );
_dest2 = _mm_add_pd( *_what++, *_toWhat++ );
*_toWhatBase++ = _dest1;
*_toWhatBase++ = _dest2;
}
but speedwise no improvement happens. So, is there any operator+=
for __m128d
? Or is there some other way that I can use to perform operator+= on arrays of doubles? The target platform is always going to be Windows (XP and 7) on Intel i7 CPUs, using MSVC.
i = i + 5
is slower thani += 5
? – Carl Norum