3
votes

I decided to translate some of Python code from Peter Harrington's Machine Learning in Action into Julia, starting with kNN algorithm.

After normalizing a dataset he provided, I wrote a few functions: find_kNN(), mass_kNN (a function that finds kNN for multiple inputs), and a function that splits a given dataset into randomly picked train and test datasets, calls mass_kNN(), and plots the resulting accuracy multiple times.

Then I compared the runtimes between the Julia code and the equivalent Python code. (I am using Distances in Julia to find the Euclidean distance and Gadfly to plot, but turning the plotting off doesn't do much to affect the time.)

Results:

Julia:
elapsed time: 1.175523034 seconds (455531636 bytes allocated, 47.54% gc time)

Python:
time elapsed: 0.9517326354980469 sec

I am wondering if there is a way to speed up my Julia code, or if it's running as fast as possible at this point (I mean if there are any glaring mistakes I made in terms of making the code run the fastest.)

Julia notebook
Python notebook
repo with both notebooks and the dataset

Thanks!..

Edit: Removing convert() statements and passing everything around as Real slowed down time to 2.29 sec.

1

1 Answers

6
votes

So first of all, I removed the last two lines of plot_probs - plotting isn't really a great thing to benchmark I think, and it's largely beyond my (or your) control - could try PyPlot if its a real factor. I also timed plot_probs a few times to see how much time is spent compiling it the first time:

**********elapsed time: 1.071184218 seconds (473218720 bytes allocated, 26.36% gc time)
**********elapsed time: 0.658809962 seconds (452017744 bytes allocated, 40.29% gc time)
**********elapsed time: 0.660609145 seconds (452017680 bytes allocated, 40.45% gc time)

So there is a 0.3s penalty paid once. Moving on to the actual algorithm, I used the in built profiler (e.g. @profile plot_probs(norm_array, 0.25, [1:3], 10, 3)), which revealed all of the time (essentially) is spent here:

  • ~80%: [ push!(dist, euclidean(set_array[i,:][:], input_array)) for i in 1:size(set_array, 1) ]
  • ~20%: [ d[i] = get(d, i, 0) + 1 for i in labels[sortperm(dist)][1:k] ]

Using array comprehensions like that is not idiomatic Julia (or Python for that matter). The first is also slow because all that slicing makes many copies of the data. I'm not an expert with Distances.jl, but I think you can replace it with

dist = Distances.colwise(Euclidean(), set_array', input_array)

d = Dict{Int,Int}()
for i in labels[sortperm(dist)][1:k]
    d[i] = get(d, i, 0) + 1
end

which gave me

**********elapsed time: 0.731732444 seconds (234734112 bytes allocated, 20.90% gc time)
**********elapsed time: 0.30319397 seconds (214057552 bytes allocated, 37.84% gc time)

More performance could be extracted by doing the transpose once in mass_kNN, but that required touching a few too many places and this post is long enough. Trying to micro-optimize it led me to using

dist=zeros(size(set_array, 1)
@inbounds for i in 1:size(set_array, 1)
    d = 0.0
    for j in 1:length(input_array)
        z = set_array[i,j] - input_array[j]
        d += z*z
    end
    dist[i] = sqrt(d)
end

which gets it to

**********elapsed time: 0.646256408 seconds (158869776 bytes allocated, 15.21% gc time)
**********elapsed time: 0.245293449 seconds (138817648 bytes allocated, 35.40% gc time)

so took about half the time off - but not really worth it, and less flexible (e.g. what if I wanted L1). Other code review points (unsolicited, I know):

  • I find Vector{Float64} and Matrix{Float64} much easier on the eye than Array{Float64,1} and Array{Float64,2} and less likely to be confused.
  • Float64[] is more usual than Array(Float64, 0)
  • Int64 can just be written as Int, since it doesn't need to be a 64-bit integer.