Do we need vectorization in C++ or are for loops already fast enough?

Question

In Matlab we use vectorization to speed up code. For example, here are two ways of performing the same calculation:

% Loop
tic
i = 0;
for t = 0:.01:1e5
    i = i + 1;
    y(i) = sin(t);
end
toc

% Vectorization
tic
t = 0:.01:1e5;
y = sin(t);
toc

The results are:

Elapsed time is 1.278207 seconds. % For loop
Elapsed time is 0.099234 seconds. % Vectorization

So the vectorized code is almost 13 times faster. Actually, if we run it again we get:

Elapsed time is 0.200800 seconds. % For loop
Elapsed time is 0.103183 seconds. % Vectorization

The vectorized code is now only 2 times as fast instead of 13 times as fast. So it appears we get a huge speedup on the first run of the code, but on future runs the speedup is not as great since Matlab appears to know that the for loop hasn't changed and is optimizing for it. In any case the vectorized code is still twice as fast as the for loop code.

Now I have started using C++ and I am wondering about vectorization in this language. Do we need to vectorize for loops in C++ or are they already fast enough? Maybe the compiler automatically vectorizes them? Actually, I don't know if Matlab type vectorization is even a concept in C++, maybe its just needed for Matlab because this is an interpreted language? How would you write the above function in C++ to make it as efficient as possible?

You may want to check and measure results of a simple loop vs. implementation using std::valarray. — πάντα ῥεῖ
Depending on the hardware and usecase vectorization gives you between 1x and 16x speedup. This is not insignificant. But you have to invest a lot of effort in your implementation, if you cant fall back to vectorized libraries. Therefore the answer to do you need vectorization in c++ is a "it depends...". — OutOfBound
According to Mathworks, large parts of the core code in the matlab interpreter itself are written in C, and most of the code that does parallelisation is written in C++. (And the user interface in modern versions of Matlab is written in Java). So, in that sense, the performance of Matlab when acting on matrices demonstrates what is possible to achieve with C++ - with enough dedicated developer effort. A hand-written loop tends to be slower in Matlab because more work is done in the interpreter, rather than in code in C++ (or other languages) that is hand-crafted for performance on a matrix. — Peter
@πάνταῥεῖ see Why is valarray so slow?. No one cares about optimizing std::valarray — phuclv
@phuclv Never used it, just remembered the "equivalence" of functionality. — πάντα ῥεῖ

eerorika eerorika · Accepted Answer · 2021-02-08T11:08:24

Do we need vectorization in C++

Vectorisation is not necessarily needed always, but it can make some programs faster.

C++ compilers support auto-vectorisation, although if you need to have vectorisation, then you might not be able to rely on such optimisation because not every loop can be vectorised automatically.

are [loops] already fast enough?

Depends on the loop, the target CPU, the compiler and its options, and crucially: How fast does it need to be.

Some things that you could do to potentially achieve vectorisation in standard C++:

Enable compiler optimisations that perform auto vectorisation. (See the manual of your compiler)
Specify a target CPU that has vector operations in their instruction set. (See the manual of your compiler)
Use standard algorithms with std::parallel_unsequenced_policy or std::unsequenced_policy.
Ensure that the data being operated on is sufficiently aligned for SIMD instructions. You can use alignas. See the manual of the target CPU for what alignment you need.
Ensure that the optimiser knows as much as possible by using link time optimisation.
Partially unroll your loops. Limitation of this is that you hard code the amount of parallelisation:

for (int i = 0; i < count; i += 4) {
    operation(i + 0);
    operation(i + 1);
    operation(i + 2);
    operation(i + 3);
}

Outside of standard, portable C++, there are implementation specific ways:

Some compilers provide language extension to write explicitly vectorised programs. This is portable across different CPUs but not portable to compilers that don't implement the extension.

using v4si = int __attribute__ ((vector_size (16)));
v4si a, b, c;
a = b + 1;    /* a = b + {1,1,1,1}; */
a = 2 * b;    /* a = {2,2,2,2} * b; */

Some compilers provide "builtin" functions to invoke specific CPU instructions which can be used to invoke SIMD vector instructions. Using these is not portable across incompatible CPUs.
Some compilers support OpenMP API which has #pragma omp simd.

Do we need vectorization in C++ or are for loops already fast enough?

1 Answers