For personnal and fun matter, I'm coding a geom lib using SSE(4.1).
I spend last 12h trying to understand a performance issue when dealing with row major vs column major stored matrix.
I know Dirext/OpenGL matrices are stored row major, so it would be better for me to keep my matrices stored in row major order so I will have no conversion when storing/loading matrices to/from GPU/shaders.
But, I made some profiling, and I get faster result with colomun major.
To transform a point with a transfrom matrix in row major, it's P' = P * M. and in column major, it's P' = M * P. So in Column major it's simply 4 dot product , so only 4 SSE4.1 instruction ( _mm_dp_ps ) when in Row major I must do those 4 dot products on the transposed matrix.
Performance result on 10M vectors
(30/05/2014@08:48:10) Log : [5] ( Vec.Mul.Matrix ) = 76.216653 ms ( row major transform )
(30/05/2014@08:48:10) Log : [6] ( Matrix.Mul.Vec ) = 61.554892 ms ( column major tranform )
I tried several way to do Vec * Matrix operation, using _MM_TRANSPOSE or not, and the fastest way I found is this :
mssFloat Vec4::operator|(const Vec4& v) const //-- Dot Product
{
return _mm_dp_ps(m_val, v.m_val, 0xFF ).m128_f32[0];
}
inline Vec4 operator*(const Vec4& vec,const Mat4& m)
{
return Vec4( Vec4( m[0][0],m[1][0],m[2][0],m[3][0]) | vec
, Vec4( m[0][1],m[1][1],m[2][1],m[3][1]) | vec
, Vec4( m[0][2],m[1][2],m[2][2],m[3][2]) | vec
, Vec4( m[0][3],m[1][3],m[2][3],m[3][3]) | vec
);
}
my class Vec4 is simply a __m128 m_val, in optimized C++ the vector construction is all done efficiently on SSE register.
My first guess, is that this multiplication is not optimal. I'm new in SSE, so I'm a bit puzzled how to optimize this, my intuition tell me to use shuffle instruction, but I'd like to understand why it would be faster. Will it load 4 shuffle __m128 faster than assigning ( __m128 m_val = _mm_set_ps(w, z, y, x); )
From https://software.intel.com/sites/landingpage/IntrinsicsGuide/ I couldn't find performance info on mm_set_ps
EDIT : I double check the profiling method, each test are done in the same manner, so no memory cache differences. To avoid local cache, I'm doing operation for randomized bug vector array, seed is same for each test. Only 1 test at each execution to avoir performance increase from memory cache.