Personal SSE library

Question

Ok, so I've been using operator overloading with some of the SSE/AVX intrinsics to facilitate their usage in more trivial situations where vector processing is useful. The class definition looks something like this:

#define Float16a float __attribute__((__aligned__(16)))

class sse
{
    private:

        __m128 vec  __attribute__((__aligned__(16)));

        Float16a *temp;

    public:

//=================================================================

        sse();
        sse(float *value);

//=================================================================

        void operator + (float *param);
        void operator - (float *param);
        void operator * (float *param);
        void operator / (float *param);
        void operator % (float *param);

        void operator ^ (int number);
        void operator = (float *param);

        void operator == (float *param);
        void operator += (float *param);
        void operator -= (float *param);
        void operator *= (float *param);
        void operator /= (float *param);
};

With each individual function bearing a resemblance to:

void sse::operator + (float *param)
{
    vec = _mm_add_ps(vec, _mm_load_ps(param));
    _mm_store_ps(temp, vec);
}

Thus far I have had few problems writing the code but I have run into a few performance problems, when using when compared with farly trivial scalar code the SSE/AVX code has a significant performance bump. I know that this type of code can be difficult profile, but I'm not really even sure what exactly the the bottleneck is. If there are any pointers that can be thrown at me it would be appreciated.

Note that this is just a person project that I'm writing to further my own knowledge of SSE/AVX, so replacing this with an external library would not be much of a help.

Everything (almost everything) you need to know to make your own SSE/AVX SIMD class can be found here vectorclass — Z boson
Since you tagged the question [gcc], you can directly write v+w, no need to create your own wrapper and call intrinsics. As a bonus, the same code will work for neon, altivec, etc. gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html — Marc Glisse

torak torak · Accepted Answer · 2013-08-20T02:27:04

It would seem to me that the amount of overhead that you are introducing could easily overwhelm any speed you gain through the use of SSE operations.

Without looking at the assembly produced I can't definitely say what is happening, but here are two possible forms of overhead.

Calling a function (unless it is inlined) involves a call and a ret, and most likely a push and a pop etc.. to create a stack frame.

You're calling _mm_store_ps for each operation, if you chain together a more than one operation you're paying the cost of this more times than necessary.

Also, it isn't clear from your code if this is a problem, but make sure that temp is a valid pointer.

Hope that helps somewhat. Good luck.

Follow up for comment.

Not sure if this is good C++ or not, please educate me if it isn't, but here's what I'd propose given my limited knowledge. I'd actually be very interested if other people have better suggestions.

Use what I believe is called a "conversion operator", but since you're the return isn't a single float and is instead 4 floats you also need to add a type.

typedef struct float_data
{
  float data[4];
};

class sse
{
  ...
  float_data floatData;
  ...
  operator float_data&();
  ...
};

sse::operator float_data&()
{
  _mm_store_ps(floatData.data, vec);
  return &float_data;
}

Personal SSE library

3 Answers