Does Julia need vectorization to speed up computation?

Question

In Python, it is usually suggested to vectorize the code to make computation faster. For example, if you want to compute the inner product of two vectors, say a and b, usually

Code A

c = np.dot(a, b)

is better than

Code B

c = 0
for i in range(len(a)):
    c += a[i] * b[i]

But in Julia, it seems sometimes vectorization is not that helpful. I reckoned '* and dot as vectorized versions and an explicit for loop as a non-vectorized version and got the following result.

using Random
using LinearAlgebra

len = 1000000
rng1 = MersenneTwister(1234)
a = rand(rng1, len)
rng2 = MersenneTwister(12345)
b = rand(rng2, len)


function inner_product(a, b)
    result = 0
    for i in 1: length(a) 
        result += a[i] * b[i] 
    end
    return result
end


@time a' * b
@time dot(a, b)
@time inner_product(a, b);

  0.013442 seconds (56.76 k allocations: 3.167 MiB)
  0.003484 seconds (106 allocations: 6.688 KiB)
  0.008489 seconds (17.52 k allocations: 976.752 KiB)

(I know using BenchmarkTools.jl is a more standard way to measure the performance.)

From the output, dot runs faster than for than '*, which is a contradiction to what has been presumed.

So my question is,

does Julia need (or sometimes need) vectorization to speed up computation?
If it does, then when to use vectorization and which is the better way to use (consider dot and '*)?
If it does not, then what is the difference between Julia and Python in terms of the mechanism of vectorized and non-vectorized codes?

Przemyslaw Szufel Przemyslaw Szufel · Accepted Answer · 2020-12-28T10:09:11

Your are not making the benchmarks correctly and the implementation of your function is suboptimal.

julia> using BenchmarkTools                   
                                              
julia> @btime $a' * $b                        
  429.400 μs (0 allocations: 0 bytes)         
249985.3680190253                             
                                              
julia> @btime dot($a,$b)                      
  426.299 μs (0 allocations: 0 bytes)         
249985.3680190253                             
                                              
julia> @btime inner_product($a, $b)           
  970.500 μs (0 allocations: 0 bytes)         
249985.36801903677

The correct implementation:

function inner_product_correct(a, b)
    result = 0.0 #use the same type as elements in the args
    @simd for i in 1: length(a)
        @inbounds result += a[i] * b[i]
    end
    return result
end

julia> @btime inner_product_correct($a, $b)
  530.499 μs (0 allocations: 0 bytes)
249985.36801902478

There is still the difference (however less significant) because dot is using the optimized BLAS implementation which is multi-threaded. You could (following Bogumil's comment set OPENBLAS_NUM_THREADS=1 and then you will find that the times of BLAS function will be identical as the Julia implementation.

Note also that working with float numbers is tricky in many ways:

julia> inner_product_correct(a, b)==dot(a,b)
false

julia> inner_product_correct(a, b) ≈ dot(a,b)
true

Finally, in Julia deciding whether to use vectorization or write the loop yourself is up two you - there is no performance penalty (as long as you write type stable code and use @simd and @inbounds where required). However in your codes you were not testing vectorization but you were comparing calling BLAS to writing the loop yourself. Here is the must-read to understand what is going on https://docs.julialang.org/en/v1/manual/performance-tips/

Does Julia need vectorization to speed up computation?

2 Answers