I have to optimize a piece of MATLAB code. The code is simple, yet it is a part of a calculation unit, which calls it ~8000 times (without redundancy) (This calculation unit is used ~10-20K times in real cases). The whole MATLAB code is quite long and complex (for a physicist, like me), yet MATLAB profiler claims that the following segment is responsible to nearly half the run time (!).
The code is essentially multiplying elementwise every permutation of 3 matrices from 3 groups (A,B,C) and sums it up with some weighting. Group A has a single matrix, group B has 4 matrices and group C has 7.
I tried some vectorizations techniques* yet at best got the same run time.
Using the MATLAB profiler I checked the total time spent at each line (for all 8000 calls) - I wrote those in comments.
for idx_b = 1:4
B_MAT=B_Container_Cell{idx_b};
for idx_c = 1:7
C_MAT = C_Container_Cell{idx_b}(:,:,idx_c); % 60 sec
ACB=A_MAT.*C_MAT.*B_MAT; % 20 sec
Coeff_x = Coeff_x_Cell{idx_b}(p1,p2,idx_c,p3);
Coeff_y = Coeff_y_Cell{idx_b}(p1,p2,idx_c,p3);
Coeff_z = Coeff_z_Cell{idx_b}(p1,p2,idx_c,p3);
Sum_x = Sum_x+Coeff_x.*ACB; % 15 sec
Sum_y = Sum_y+Coeff_y.*ACB; % 15 sec
Sum_z = Sum_z+Coeff_z.*ACB; % 15 sec
end
end
Some prior knowledge -
A_MAT is 1024x1024 complex double constant matrix defined ouside the loop
B_MAT is 1024x1024 double matrix, essentially sparse (only 0 and 1 values, ones are ~5% out of total elements)
C_MAT is 1024x1024 complex double
Sum_x/ Sum_y / Sum_z were properly initiated
Coeff_X / Coeff_y / Coeff_z are double scalars
p1,p2,p3 are parameters (constant for this code segment)
Does anybody know why the most consuming operation is variable assignment? (I tried to skip the assignment and replace C_MAT directly with it's expression, yet it worsens the performance)
Vectorization attempt
The techique I tried is to use cat, reshape and repmat to create 3 giant 2D matrices, element-wise multiply those and then put the all on top of each other (with reshape) and sum via the relevant dimention. The first matrix was A repeated 4*7=28 times, the second was the 4 B matrices repeated 7 times and the third was all C matrices spanned (=28 matrices).
Sample Input
The code on the following link generates sample input files. The run time with these variables (on my computer) is ~0.38 sec (the original code+variables ~0.42, the difference in my opinion is because the real C Cell container is very large, so extraction takes more time)
.*operation is commutative (unlike*matrix multiply), so you can performA_MAT*B_MAToutside the inner loop. - Ben VoigtB_Container_Cellcould be replaced by1024x1024x4array,C_Container_Cellby1024x1024x7x4and similarly4Darrays forCoeff_x_Cell,Coeff_y_CellandCoeff_z_Celleach. Wouldn't that work? - Divakar