Ok, so I've been using operator overloading with some of the SSE/AVX intrinsics to facilitate their usage in more trivial situations where vector processing is useful. The class definition looks something like this:
#define Float16a float __attribute__((__aligned__(16)))
class sse
{
private:
__m128 vec __attribute__((__aligned__(16)));
Float16a *temp;
public:
//=================================================================
sse();
sse(float *value);
//=================================================================
void operator + (float *param);
void operator - (float *param);
void operator * (float *param);
void operator / (float *param);
void operator % (float *param);
void operator ^ (int number);
void operator = (float *param);
void operator == (float *param);
void operator += (float *param);
void operator -= (float *param);
void operator *= (float *param);
void operator /= (float *param);
};
With each individual function bearing a resemblance to:
void sse::operator + (float *param)
{
vec = _mm_add_ps(vec, _mm_load_ps(param));
_mm_store_ps(temp, vec);
}
Thus far I have had few problems writing the code but I have run into a few performance problems, when using when compared with farly trivial scalar code the SSE/AVX code has a significant performance bump. I know that this type of code can be difficult profile, but I'm not really even sure what exactly the the bottleneck is. If there are any pointers that can be thrown at me it would be appreciated.
Note that this is just a person project that I'm writing to further my own knowledge of SSE/AVX, so replacing this with an external library would not be much of a help.