3
votes

For personnal and fun matter, I'm coding a geom lib using SSE(4.1).

I spend last 12h trying to understand a performance issue when dealing with row major vs column major stored matrix.

I know Dirext/OpenGL matrices are stored row major, so it would be better for me to keep my matrices stored in row major order so I will have no conversion when storing/loading matrices to/from GPU/shaders.

But, I made some profiling, and I get faster result with colomun major.

To transform a point with a transfrom matrix in row major, it's P' = P * M. and in column major, it's P' = M * P. So in Column major it's simply 4 dot product , so only 4 SSE4.1 instruction ( _mm_dp_ps ) when in Row major I must do those 4 dot products on the transposed matrix.

Performance result on 10M vectors

(30/05/2014@08:48:10) Log : [5] ( Vec.Mul.Matrix ) = 76.216653 ms ( row major transform )

(30/05/2014@08:48:10) Log : [6] ( Matrix.Mul.Vec ) = 61.554892 ms ( column major tranform )

I tried several way to do Vec * Matrix operation, using _MM_TRANSPOSE or not, and the fastest way I found is this :

mssFloat    Vec4::operator|(const Vec4& v) const //-- Dot Product
{
    return _mm_dp_ps(m_val, v.m_val, 0xFF ).m128_f32[0];
}
inline Vec4 operator*(const Vec4& vec,const Mat4& m)
{
    return Vec4(    Vec4( m[0][0],m[1][0],m[2][0],m[3][0]) | vec
        ,   Vec4( m[0][1],m[1][1],m[2][1],m[3][1]) | vec
        ,   Vec4( m[0][2],m[1][2],m[2][2],m[3][2]) | vec
        ,   Vec4( m[0][3],m[1][3],m[2][3],m[3][3]) | vec
                );
}

my class Vec4 is simply a __m128 m_val, in optimized C++ the vector construction is all done efficiently on SSE register.

My first guess, is that this multiplication is not optimal. I'm new in SSE, so I'm a bit puzzled how to optimize this, my intuition tell me to use shuffle instruction, but I'd like to understand why it would be faster. Will it load 4 shuffle __m128 faster than assigning ( __m128 m_val = _mm_set_ps(w, z, y, x); )

From https://software.intel.com/sites/landingpage/IntrinsicsGuide/ I couldn't find performance info on mm_set_ps

EDIT : I double check the profiling method, each test are done in the same manner, so no memory cache differences. To avoid local cache, I'm doing operation for randomized bug vector array, seed is same for each test. Only 1 test at each execution to avoir performance increase from memory cache.

2
OpenGL uses column major format for matrices. Well, some entry points allow you to specify if your matrices are row major or column major, but traditionally matrices have been column major.Reto Koradi
Hi, thanks replying, I'm shocked :) I read on stack overflow ( I must find again the link ) that OpenGL and DirectX 11 matrices are row major, means, translation in a transform matrix is stored in last 4 elelements opengl.org/archives/resources/faq/technical/transformations.htm "For programming purposes, OpenGL matrices are 16-value arrays with base vectors laid out contiguously in memory."MagicFr
Translation is in elements 13, 14, and 15. Which means that it's column major, because they would be in elements 3, 7, and 11 in a row major matrix. Well, this is if you multiply your vectors by writing them as column vectors to the right of the matrix. If you write them as row vectors to the left of the matrix, things switch around. So depending on if you're a row or column vector type of person, the answer changes. ;)Reto Koradi
well, sorry for the confusion, I wanted to talked about memory storage, I thought row major was when a row = 1 axis, so I guess I inverted everything. I'm no mathematicien, so in memory there's no difference between column vector or row vector ;) ( still i now the differences ;) ). I'll edit my question to be extra clear, I will not talk about row/colmun major but memory storage, would it be clearer about my problem?MagicFr
Row major means that if you read the elements in the order they are stored, the elements of the first row are the first 4 elements, the elements of the second row are the next 4, etc. Column major means that the 4 elements of the first column come first, then the 4 elements of the second column, etc.Reto Koradi

2 Answers

3
votes

Don't use _mm_dp_ps for matrix multiplication! I already explained this in great detail at Efficient 4x4 matrix vector multiplication with SSE: horizontal add and dot product - what's the point? (incidentally this was my first post on SO).

You don't need anything for more than SSE to do this efficiently (not even SSE2). Use this code to do 4x4 matrix multiplication efficiently. If the matrices are stored in row-major order than do gemm4x4_SSE(A,B,C). If the matrices are stored in column-major order than do gemm4x4_SSE(B,A,C).

void gemm4x4_SSE(float *A, float *B, float *C) {
    __m128 row[4], sum[4];
    for(int i=0; i<4; i++)  row[i] = _mm_load_ps(&B[i*4]);
    for(int i=0; i<4; i++) {
        sum[i] = _mm_setzero_ps();      
        for(int j=0; j<4; j++) {
            sum[i] = _mm_add_ps(_mm_mul_ps(_mm_set1_ps(A[i*4+j]), row[j]), sum[i]);
        }           
    }
    for(int i=0; i<4; i++) _mm_store_ps(&C[i*4], sum[i]); 
}
0
votes

We actually profiled 3x4 matrix pseudo-multiplication (as-if its a 4x4 affine) and found that in both SSE3 and AVX there was very little difference (<10%) in the column-major vs row-major layouts as long as both are optimized to the limit.

The benchmark https://github.com/buildaworldnet/IrrlichtBAW/blob/master/examples_tests/19.SIMDmatrixMultiplication/main.cpp