22
votes

I'm at a loss to explain (and avoid) the differences in speed between a Matlab mex program and the corresponding C program with no Matlab interface. I've been profiling a numerical analysis program:

int main(){

Well_optimized_code();

}

compiled with gcc 4.4 against the Matlab-Mex equivalent (directed to use gcc44, which is not the version currently supported by Matlab, but it's required for other reasons):

void mexFunction(int nlhs,mxArray* plhs[], int nrhs, const mxArray* prhs[]){

Well_optimized_code(); //literally the exact same code

}

I performed the timings as:

$ time ./C_version

vs.

>> tic; mex_version(); toc

The difference in timing is staggering. The version run from the command line takes 5.8 seconds on average. The version in Matlab runs in 21 seconds. For context, the mex file replaces an algorithm in the SimBiology toolbox that takes about 26 seconds to run.

As compared to Matlab's algorithm, both the C and mex versions scale linearly up to 27 threads using calls to openMP, but for the purposes of profiling these calls have been disabled and commented out.

The two versions have been compiled in the same way with the exception of the necessary flags to compile as a mex file: -fPIC --shared -lmex -DMATLAB_MEX_FILE being applied in the mex compilation/linking. I've removed all references to the left and right arguments of the mex file. That is to say it takes no inputs and gives no outputs, it is solely for profiling.

The Great and Glorious Google has informed me that the position independent code should not be the source of the slowdown and beyond that I'm at a loss.

Any help will be appreciated,

Andrew

2
One initial guess might be that optimizations that apply to the executable are not being applied to the shared library. How about having your executable call the MEX function instead of including the code itself? That might help isolate where the performance bottleneck is.Pablo
@Pablo I'm not sure what you mean. How would I get the executable to call the mex function without being inside Matlab?Sevenless
A MEX file is just a shared library (.dll or .so) that exports a well-known function, namely the mexFunction. You can make it so your executable loads the shared library and calls mexFunction in it. That way, the code you run for Well_optimized_code() should be identical.Pablo
It's plausible that the memory allocator under matlab is behaving differently to that in the standalone. Can you modify the optimized code to use memory differently? Also, does the slowdown happen all the time you use the function, or just the first time?Alex
@Alex It happens all the time. The timings I reported are from the calls after the first. While the first call appears to be slower on average, it is not appreciably so. Thanks for the thought.Sevenless

2 Answers

14
votes

After a month of emailing with my contacts at Mathworks, playing around with my own code, and profiling my code every which way, I have an answer; however, it may be the most dissatisfying answer I have ever had to a technical question:

The short version is "upgrade to Matlab version 2011a (officially released last week), this issue has now been resolved".

The longer version regards an issue of the overhead associated with the mex gateway in versions 2010b and earlier. The best explanation that I've been able to extract is that this overhead is not assessed once, rather we pay a little bit every time a function calls another function that is in a linked library.

While why this occurs baffles me, it is at least consistent with the SHARK profiling that I did. When I profile and compare the differences between the native app and the mex app there is a recurring pattern. The time spent in functions that are in the source code I wrote for the app does not change. The time spent in library functions increases a little when comparing between the native and mex implementations. Functions in another library used to build this library increase the difference a lot. The time difference continues to increase as we proceed ever deeper until we reach by BLAS implementation.

A couple of heavily used BLAS functions were the main culprits. A function that took ~1% of my computation time in the native app was clocking in at 30% in the mex function.

The implementation of the mex gateway appears to have changed between 2010b and 2011a. On my macbook the native app takes about 6 seconds and the mex version takes 6.5 seconds. This is overhead that I can deal with.

As for the underlying cause, I can only speculate. Matlab has it's roots in interpretive coding. Since mex functions are dynamic libraries, I'm guessing that each mex library was unaware of what it was linked against until runtime. Since Matlab suggests the user rarely use mex and then only for small computationally intensive chunks, I assume that large programs (such as an ODE solver) are rarely implemented. These programs, like mine, are the ones that suffer the most.

I've profiled a couple of Matlab functions that I know to be implemented in C then compiled using mex (especially sbiosimulate after calling sbioaccelerate on kinetic models, part of the SimBiology toolbox) and there appears to be some significant speed ups. So the 2011a update appears to be more broadly beneficial than the usual semi-yearly upgrade.

Best of luck to other coders with the similar issues. Thanks for all of the helpful advice that got me started in the right direction.

--Andrew

3
votes

Recall that Matlab stores arrays as column major, and C/C++ as row major. Is it possible that your loop structure/algorithm is iterating in a row major fashion, resulting in poor memory access times in Matlab, but fast access times in C/C++ ?