Few examples could be suggested to study for-loop versus vectorization for performance.
Example #1
This is just a very basic computation of calculating sine of a number of elements. This count of elements was varied to assess the problem in hand. Inspired by this screenshot link .
Benchmarking Code
num_runs = 1000;
N_arr = [ 1000 10000 100000 1000000];
%// Warm up tic/toc.
for k = 1:100
tic(); elapsed = toc();
end
for k = 1:numel(N_arr)
N = N_arr(k);
tic
for runs=1:num_runs
out_f1 = zeros(1,N);
for t = 1:N
out_f1(t) = sin(t);
end
end
t_forloop = toc/num_runs;
tic
for runs=1:num_runs
out_v1 = sin(1:N);
end
t_vect = toc/num_runs;
end
Results
----------- Datsize(N) = 1000 -------------
Elapsed time with for-loops - 7.1826e-05
Elapsed time with vectorized code - 8.3601e-05
----------- Datsize(N) = 10000 -------------
Elapsed time with for-loops - 0.00068531
Elapsed time with vectorized code - 0.00045043
----------- Datsize(N) = 100000 -------------
Elapsed time with for-loops - 0.0074613
Elapsed time with vectorized code - 0.0053368
----------- Datsize(N) = 1000000 -------------
Elapsed time with for-loops - 0.077707
Elapsed time with vectorized code - 0.053255
Please note that these results were coherent with timeit results (code and results of those aren't shown here).
Conclusions
- The results show that you can forget about
for-loops as quickly as 10000 elements cases.
Example #2
Let's consider a case of using an array of elements inside each iteration of for-loop. Let it store sine, cosine, tan and sec into one column in each iteration, i.e. [sin(t) ; cos(t) ; tan(t) ; sec(t)].
For-loop code would be -
out_f1 = zeros(4,N);
for t = 1:N
out_f1(:,t) = [sin(t) ; cos(t) ; tan(t) ; sec(t)];
end
Vectorized code -
out_v1 = [sin(1:N); cos(1:N) ; tan(1:N); sec(1:N)];
Results
----------- Datsize(N) = 100 -------------
Elapsed time with for-loops - 0.00011861
Elapsed time with vectorized code - 6.0569e-05
----------- Datsize(N) = 1000 -------------
Elapsed time with for-loops - 0.0011867
Elapsed time with vectorized code - 0.00036786
----------- Datsize(N) = 10000 -------------
Elapsed time with for-loops - 0.011819
Elapsed time with vectorized code - 0.0025536
----------- Datsize(N) = 1000000 -------------
Elapsed time with for-loops - 1.2329
Elapsed time with vectorized code - 0.33383
Modified case
One could easily jump into the conclusion that for-loop doesn't stand a chance here. But wait, how about we do element-wise assignment again as in example #1 for for-loop case, like this -
out_f1 = zeros(4,N);
for t = 1:N
out_f1(1,t) = sin(t);
out_f1(2,t) = cos(t);
out_f1(3,t) = tan(t);
out_f1(4,t) = sec(t);
end
Now, this uses spatial locality, so a competitive vectorized code using the same would be -
out_v1 = [sin(1:N) cos(1:N) tan(1:N) sec(1:N)]';
The benchmark results with these modified codes for this testcase were -
----------- Datsize(N) = 100 -------------
Elapsed time with for-loops - 3.1987e-05
Elapsed time with vectorized code - 6.9778e-05
----------- Datsize(N) = 1000 -------------
Elapsed time with for-loops - 0.00027976
Elapsed time with vectorized code - 0.00036804
----------- Datsize(N) = 10000 -------------
Elapsed time with for-loops - 0.0029712
Elapsed time with vectorized code - 0.0024423
----------- Datsize(N) = 100000 -------------
Elapsed time with for-loops - 0.031113
Elapsed time with vectorized code - 0.028549
----------- Datsize(N) = 1000000 -------------
Elapsed time with for-loops - 0.32636
Elapsed time with vectorized code - 0.28063
Conclusions
The latter benchmark results seem to prove again that for upto 10000 elements for-loop wins and after that vectorized solutions would be preferred. But it must be noted that this came at the expense of writing element-wise assignments.
Final Conclusions
- On the argument of deciding which side (for-loop or vectorization) is better, seems like it's far from a black and white picture.