0
votes

Ok, so I've been using operator overloading with some of the SSE/AVX intrinsics to facilitate their usage in more trivial situations where vector processing is useful. The class definition looks something like this:

#define Float16a float __attribute__((__aligned__(16)))

class sse
{
    private:

        __m128 vec  __attribute__((__aligned__(16)));

        Float16a *temp;

    public:

//=================================================================

        sse();
        sse(float *value);

//=================================================================

        void operator + (float *param);
        void operator - (float *param);
        void operator * (float *param);
        void operator / (float *param);
        void operator % (float *param);

        void operator ^ (int number);
        void operator = (float *param);

        void operator == (float *param);
        void operator += (float *param);
        void operator -= (float *param);
        void operator *= (float *param);
        void operator /= (float *param);
};

With each individual function bearing a resemblance to:

void sse::operator + (float *param)
{
    vec = _mm_add_ps(vec, _mm_load_ps(param));
    _mm_store_ps(temp, vec);
}

Thus far I have had few problems writing the code but I have run into a few performance problems, when using when compared with farly trivial scalar code the SSE/AVX code has a significant performance bump. I know that this type of code can be difficult profile, but I'm not really even sure what exactly the the bottleneck is. If there are any pointers that can be thrown at me it would be appreciated.

Note that this is just a person project that I'm writing to further my own knowledge of SSE/AVX, so replacing this with an external library would not be much of a help.

3
Everything (almost everything) you need to know to make your own SSE/AVX SIMD class can be found here vectorclassZ boson
Since you tagged the question [gcc], you can directly write v+w, no need to create your own wrapper and call intrinsics. As a bonus, the same code will work for neon, altivec, etc. gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.htmlMarc Glisse

3 Answers

0
votes

It would seem to me that the amount of overhead that you are introducing could easily overwhelm any speed you gain through the use of SSE operations.

Without looking at the assembly produced I can't definitely say what is happening, but here are two possible forms of overhead.

Calling a function (unless it is inlined) involves a call and a ret, and most likely a push and a pop etc.. to create a stack frame.

You're calling _mm_store_ps for each operation, if you chain together a more than one operation you're paying the cost of this more times than necessary.

Also, it isn't clear from your code if this is a problem, but make sure that temp is a valid pointer.

Hope that helps somewhat. Good luck.


Follow up for comment.

Not sure if this is good C++ or not, please educate me if it isn't, but here's what I'd propose given my limited knowledge. I'd actually be very interested if other people have better suggestions.

Use what I believe is called a "conversion operator", but since you're the return isn't a single float and is instead 4 floats you also need to add a type.

typedef struct float_data
{
  float data[4];
};

class sse
{
  ...
  float_data floatData;
  ...
  operator float_data&();
  ...
};

sse::operator float_data&()
{
  _mm_store_ps(floatData.data, vec);
  return &float_data;
}
0
votes

This is part if my SSE library. When processing massive datas I always use SoA intead of SoA. And operator overloading for _m128/_m256 make it easy to convert C/C++ algorithm to SIMD.

Load/Store is not supported by library because SSE/AVX is very sensitive to memory operations. Poor memory access causes dozens of CPU cycles and stall your calculation.

__forceinline   __m128  operator+(__m128 l, __m128 r)   { return _mm_add_ps(l,r);       }
__forceinline   __m128  operator-(__m128 l, __m128 r)   { return _mm_sub_ps(l,r);       }
__forceinline   __m128  operator*(__m128 l, __m128 r)   { return _mm_mul_ps(l,r);       }
__forceinline   __m128  operator/(__m128 l, __m128 r)   { return _mm_div_ps(l,r);       }
__forceinline   __m128  operator&(__m128 l, __m128 r)   { return _mm_and_ps(l,r);       }
__forceinline   __m128  operator|(__m128 l, __m128 r)   { return _mm_or_ps(l,r);        }
__forceinline   __m128  operator<(__m128 l, __m128 r)   { return _mm_cmplt_ps(l,r);     }
__forceinline   __m128  operator>(__m128 l, __m128 r)   { return _mm_cmpgt_ps(l,r);     }
__forceinline   __m128  operator<=(__m128 l, __m128 r)  { return _mm_cmple_ps(l,r);     }
__forceinline   __m128  operator>=(__m128 l, __m128 r)  { return _mm_cmpge_ps(l,r);     }
__forceinline   __m128  operator!=(__m128 l, __m128 r)  { return _mm_cmpneq_ps(l,r);    }
__forceinline   __m128  operator==(__m128 l, __m128 r)  { return _mm_cmpeq_ps(l,r);     }

__forceinline   __m128  _mm_merge_ps(__m128 m, __m128 l, __m128 r)
{
    return _mm_or_ps(_mm_andnot_ps(m, l), _mm_and_ps(m, r));
}

struct TPoint4
{
    TPoint4() {}
    TPoint4(const D3DXVECTOR3& a) :x(_mm_set1_ps(a.x)), y(_mm_set1_ps(a.y)), z(_mm_set1_ps(a.z)) {}
    TPoint4(__m128 a, __m128 b, __m128 c) :x(a), y(b), z(c) {}
    TPoint4(const __m128* a) :x(a[0]), y(a[1]), z(a[2]) {}
    TPoint4(const D3DXVECTOR3& a, const D3DXVECTOR3& b, const D3DXVECTOR3& c, const D3DXVECTOR3& d) :x(_mm_set_ps(a.x,b.x,c.x,d.x)), y(_mm_set_ps(a.y,b.y,c.y,d.y)), z(_mm_set_ps(a.z,b.z,c.z,d.z)) {}

    operator __m128* ()             { return &x; }
    operator const __m128* () const { return &x; }

    TPoint4 operator+(const TPoint4& r) const   { return TPoint4(x+r.x, y+r.y, z+r.z);  }
    TPoint4 operator-(const TPoint4& r) const   { return TPoint4(x-r.x, y-r.y, z-r.z);  }
    TPoint4 operator*(__m128 r) const           { return TPoint4(x * r, y * r, z * r);  }
    TPoint4 operator/(__m128 r) const           { return TPoint4(x / r, y / r, z / r);  }

    __m128 operator[](int index) const          { return _val[index];                   }

    union
    {
        struct
        {
                __m128 x, y, z;
        };
        struct
        {
                __m128 _val[3];
        };
    };


};

__forceinline TPoint4* TPoint4Cross(TPoint4* result, const TPoint4* l, const TPoint4* r)
{
    result->x = (l->y * r->z) - (l->z * r->y);
    result->y = (l->z * r->x) - (l->x * r->z);
    result->z = (l->x * r->y) - (l->y * r->x);

    return result;
}

__forceinline __m128 TPoint4Dot(const TPoint4* l, const TPoint4* r)
{
    return (l->x * r->x) + (l->y * r->y) + (l->z * r->z);
}

__forceinline TPoint4* TPoint4Normalize(TPoint4* result, const TPoint4* l)
{
    __m128 rec_len = _mm_rsqrt_ps( (l->x * l->x) + (l->y * l->y) + (l->z * l->z) );

    result->x = l->x * rec_len;
    result->y = l->y * rec_len;
    result->z = l->z * rec_len;

    return result;
}

__forceinline __m128 TPoint4Length(const TPoint4* l)
{
    return _mm_sqrt_ps( (l->x * l->x) + (l->y * l->y) + (l->z * l->z) );
}

__forceinline TPoint4* TPoint4Merge(TPoint4* result, __m128 mask, const TPoint4* l, const TPoint4* r)
{
    result->x = _mm_merge_ps(mask, l->x, r->x);
    result->y = _mm_merge_ps(mask, l->y, r->y);
    result->z = _mm_merge_ps(mask, l->z, r->z);

    return result;
}

extern __m128   g_zero4;
extern __m128   g_one4;
extern __m128   g_fltMax4;
extern __m128   g_mask4;
extern __m128   g_epsilon4;
0
votes

If you are just learning SSE, I suggest using only raw intrinsics without any structs. In this case it would be significantly easier for you to see what is going on, and tweak performance to its best. Coding with intrinsics is almost the same as coding directly in assembler, with only difference that compiler does register allocation and manages memory loads/stores itself.

Speaking of your wrapper class, it has several problems:

  1. Remove temp pointer. It adds unnecessary data which is constantly moved around.
  2. Remove default constructor. In most cases you don't want to waste time each time you declare a new variable. And do not implement destructor, copy/move constructors and assignments: they will only slow you down at the end.
  3. Define (i.e. write function body) all your operators in the header file. If you write implementations of your operators in a cpp file, it may prevent compiler from inlining them (unless you use link-time optimization, see this for example).
  4. Accept arguments of type sse by value wherever possible. If you pass float*, then you'll likely have to load the value from this pointer. However, in most cases it is not necessary: the data is already in register. When you use values of type __m128, the compiler can decide itself if it has to save/load data to memory.
  5. Return value of type sse from each nonmodifying operator. Right now you store the result into a memory pointer, which is implemented in ugly way. This forces compiler to really store data to memory instead of simply keeping the value in register. When you return __m128 by value, compiler cam decide when to save/load data.

Here is your code rewritten for better performance and usability:

class sse {
private:
    __m128 vec;
public:
    explicit sse(float *ptr) { vec = _mm_loadu_ps(ptr); }
    sse(__m128 reg) { vec = reg; }
    void store(float *ptr) { _mm_storeu_ps(ptr, vec); }

    sse operator + (sse other) const {
        return sse(_mm_add_ps(vec, other.vec));
    }
    sse operator - (sse other) {...}
    sse operator * (sse other) {...}
    sse operator / (sse other) {...}

    void operator += (sse other) {
        vec = _mm_add_ps(vec, other.vec);
    }
    void operator -= (float *param) {...}
    void operator *= (float *param) {...}
    void operator /= (float *param) {...}

    //I don't know what you mean by these operators:
    //void operator ^ (int number);
    //void operator == (float *param);
    //sse operator % (sse other);
};

P.S. In any case you should regularly inspect the assembly generated by your compiler in order to see if it has any performance issues.