39
votes

I adapted a simple program to compute and plot the movement vortices of to Julia to test the language, I also wrote it in Python for no particular reason.

(Disclaimer: 1. Every performance comparison on stackoverflow I read gets slammed for not being comprehensive/correct/well written/relevant etc. etc. - I'm not pretending this is a real comparison, I would just like to know how to make the Julia quicker. 2. I know the python could be optimized, implemented in Cython etc, that's not part of this discussion, it's just here for a reference of equivalent functions in Julia and Python.)

The code and performance results can been seen in a gist.

The performance of Julia is significantly slower than Fortran. The times taken to perform the calculation itself are (50000 time steps):

Fortran: 0.051s
Julia: 2.256s
Python: 30.846s

Julia is much slower (~44 times slow) than Fortran, the gap narrows but is still significant with 10x more time steps( 0.50s vs 15.24s ).

These results are significantly different to those shown on the julia home page. What am I doing wrong? Could I fix the Julia to be significantly quicker?

I've skim read the Julia Performance Tips page and the code behind the comparison on the Julia home page and nothing is standing out to me to fix.

Also interestingly Julia is extremely slow to load PyPlot ( 5secs ish!!) and much slower than Python to read the text file. Could I do anything to improve these things?

Note that the times above don't show loading time for Julia and Python, it's just the raw time taken for the computation AFAIK - see the code. For fortran it's the whole thing. The plotting has been turned off, roughly, in each case to allow speed comparison.

Computer: Intel i7-3770, 16GB ram, SSD HD, OS: Ubuntu 13.10 64bit., Fortran: gfortran, GNU Fortran (Ubuntu/Linaro 4.8.1-10ubuntu9) 4.8.1, Julia: Version 0.3.0-prerelease+396 (2013-12-12 00:18 UTC), Commit c5364db* (0 days old master), x86_64-linux-gnu, Python: 2.7.5+


Update:

Based on ivarne's advice I rewrote the Julia script (updated in gist above): encapsulating grunt work in functions, declaring the type of everything and splitting different elements of matrices into different arrays where applicable. (I included Float64 in quite a few places as I tried Float32 to see if that helped, it didn't most of the time).

The results are as follows:

50,000 time steps:

Fortran: 0.051s (entire programme)
Julia: raw calc.: 0.201s, calc. and return (?): 0.758s, total exec.: 6.947s

500,000 time steps:

Fortran: 0.495s (entire programme)
Julia: raw calc.: 1.547s, calc. and return (?): 2.094s, total exec.: 8.521s

In conclusion:

  • You can speed up Julia quite a bit.

  • You can significantly affect the apparently speed of Julia depending on how you measure it's performance.

2
They did mention using BLAS in the benchmarks. And matmul performance is almost identical in Fortran, C, Julia and MATLAB, that also gives it away. It might not have been a fair comparison between Languages if a good portion of the test is spent in some precompiled library. Maybe you should try replacing part of your code with BLAS calls when possible and do the comparison again?Xiaolei Zhu
There is no need to declare types on functions in Julia, unless you want to use it for multiple dispatch. List comprehensions usually get the correct type without any effort also. If you have a type/immutable structures, you must declare types to get decent performance.ivarne
A +1 for just the disclaimer.ITA

2 Answers

30
votes

I have followed the Julia project for a while now, and I have some comments to the code that might be relevant.

  • It seems like you run a substantial amount of the code in global scope. The global environment is currently very slow in Julia, because the types all variables have to be checked on every iteration. Loops should usually be written in a function.
  • You seem to use array slicing. Currently that makes a copy because Julia does not have fast Array views. You might try to switch them for subarray, but they are currently much slower than they should.

The loading time of PyPlot (and any other package) is a known issue, and is because parsing and compiling Julia code to machine code, is time consuming. There are ideas about having a cache for this process so that this process becomes instantaneous, but it is not finished yet. The Base library is currently cached in compiled state, so most of the infrastructure is on the master branch now.

ADDED: I tried to run the test in an isolated function and got these results. See this gist

Parsing:

elapsed time: 0.334042578 seconds (11797548 bytes allocated)

And tree consecutive runes of the main test loop.

elapsed time: 0.62999287 seconds (195210884 bytes allocated)
elapsed time: 0.39398753 seconds (184735016 bytes allocated)
elapsed time: 0.392036875 seconds (184735016 bytes allocated)

Notice how the timing improved after the first run, because the compiled code was used again.

Update 2 With some improved memory handling (ensure reuse of arrays, because assignment does not copy), I got the timing down to 0.2 seconds (on my machine). There is definitely more that could be done to avoid allocating new arrays, but then it starts to be a little tricky.

This line does not do what you think:

vx_old = vx

but this do what you want:

copy!(vx_old, vx)

and devectorize one loop.

x += 0.5*(vx + vx_old)*delta_t
y += 0.5*(vy + vy_old)*delta_t

to:

for i = 1:nvortex
    x[i] += 0.5*(vx[i] + vx_old[i])*delta_t
    y[i] += 0.5*(vy[i] + vy_old[i])*delta_t
end
10
votes

@ivarne covers this but it bears a little more attention:

julia> @time x=[1:10000];
elapsed time: 1.544e-5 seconds (80120 bytes allocated)

julia> @time y = x[1:10000];
elapsed time: 2.6857e-5 seconds (80120 bytes allocated)

Wow. That's a lot of time and memory.

julia> @time z = sub(x,1:10000);
elapsed time: 6.239e-6 seconds (296 bytes allocated)

Much better. Why doesn't [:] do what sub does? I don't know. Well, I sort of do. When you go to index z[10] Julia thinks, hrmm, z is like x except the indexes are offset by 0 so z[10] is x[10+0]. There you go. That little extra addition will cost you in the long run if you do a lot of indexing. To fix this you need a concept like pointers which is against Julia's religion.

Update Julia now deprecates [:] (version 0.4.0)

julia> @time x=[1:10000];
WARNING: [a] concatenation is deprecated; use collect(a) instead
in depwarn at deprecated.jl:73
in oldstyle_vcat_warning at ./abstractarray.jl:29
in vect at abstractarray.jl:32
while loading no file, in expression starting on line 155
0.530051 seconds (180.12 k allocations: 9.429 MB, 5.26% gc time)

julia> @time x=[1:10000];
WARNING: [a] concatenation is deprecated; use collect(a) instead
in depwarn at deprecated.jl:73
in oldstyle_vcat_warning at ./abstractarray.jl:29
in vect at abstractarray.jl:32
while loading no file, in expression starting on line 155
0.001373 seconds (303 allocations: 714.656 KB)

collect is faster

julia> @ time x=collect(1:10000);
0.003991 seconds (35 allocations: 80.078 KB)

julia> @ time x=collect(1:10000);
0.000031 seconds (8 allocations: 78.406 KB)

comparable to SubArrays

julia> @time z = sub(x,1:10000);
0.067002 seconds (36.27 k allocations: 1.792 MB)

julia> @time z = sub(x,1:10000);
0.000016 seconds (7 allocations: 288 bytes)