11
votes

I had some python code which I tried to port to Julia to learn this lovely language. I used generators in python. After porting it seems to me (at this moment) that Julia is really slow in this area!

I made part of my code simplified to this exercise:

Think 4x4 chess board. Find every N-moves long path, chess king could do. In this exercise, the king is not allowed to leap twice at the same position in one path. Don't waste memory -> make a generator of every path.

Algorithm is pretty simple:

if we sign every position with numbers:

0  1  2  3
4  5  6  7
8  9  10 11
12 13 14 16

point 0 has 3 neighbors (1, 4, 5). We could find a table for every neighbor for every point:

NEIG = [[1, 4, 5], [0, 2, 4, 5, 6], [1, 3, 5, 6, 7], [2, 6, 7], [0, 1, 5, 8, 9], [0, 1, 2, 4, 6, 8, 9, 10], [1, 2, 3, 5, 7, 9, 10, 11], [2, 3, 6, 10, 11], [4, 5, 9, 12, 13], [4, 5, 6, 8, 10, 12, 13, 14], [5, 6, 7, 9, 11, 13, 14, 15], [6, 7, 10, 14, 15], [8, 9, 13], [8, 9, 10, 12, 14], [9, 10, 11, 13, 15], [10, 11, 14]]

PYTHON

A recursive function (generator) which enlarge given path from the list of points or from a generator of (generator of ...) points:

def enlarge(path):
    if isinstance(path, list):
        for i in NEIG[path[-1]]:
            if i not in path:
                yield path[:] + [i]
    else:
        for i in path:
            yield from enlarge(i)

Function (generator) which give every path with given length

def paths(length):
    steps = ([i] for i in range(16))  # first steps on every point on board
    for _ in range(length-1):
        nsteps = enlarge(steps)
        steps = nsteps
    yield from steps

We could see that there are 905776 paths with length 10:

sum(1 for i in paths(10))
Out[89]: 905776

JULIA (this code was created by @gggg during our discussion here )

const NEIG_py = [[1, 4, 5], [0, 2, 4, 5, 6], [1, 3, 5, 6, 7], [2, 6, 7], [0, 1, 5, 8, 9], [0, 1, 2, 4, 6, 8, 9, 10], [1, 2, 3, 5, 7, 9, 10, 11], [2, 3, 6, 10, 11], [4, 5, 9, 12, 13], [4, 5, 6, 8, 10, 12, 13, 14], [5, 6, 7, 9, 11, 13, 14, 15], [6, 7, 10, 14, 15], [8, 9, 13], [8, 9, 10, 12, 14], [9, 10, 11, 13, 15], [10, 11, 14]];
const NEIG = [n.+1 for n in NEIG_py]
function enlarge(path::Vector{Int})
    (push!(copy(path),loc) for loc in NEIG[path[end]] if !(loc in path))
end
collect(enlarge([1]))
function enlargepaths(paths)
    Iterators.Flatten(enlarge(path) for path in paths)
end
collect(enlargepaths([[1],[2]]))
function paths(targetlen)
    paths = ([i] for i=1:16)
    for newlen in 2:targetlen
        paths = enlargepaths(paths)
    end
    paths
end
p = sum(1 for path in paths(10))

benchmark

In ipython we could time it:

python 3.6.3:

%timeit sum(1 for i in paths(10))
1.25 s ± 15.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

julia 0.6.0

julia> @time sum(1 for path in paths(10))
  2.690630 seconds (41.91 M allocations: 1.635 GiB, 11.39% gc time)
905776

Julia 0.7.0-DEV.0

julia> @time sum(1 for path in paths(10))
  4.951745 seconds (35.69 M allocations: 1.504 GiB, 4.31% gc time)
905776

Question(s):

We Julians are saying this: It is important to note that the benchmark codes are not written for absolute maximal performance (the fastest code to compute recursion_fibonacci(20) is the constant literal 6765). Instead, the benchmarks are written to test the performance of identical algorithms and code patterns implemented in each language.

In this benchmark, we are using the same idea. Just simple for cycles over arrays enclosed to generators. (Nothing from numpy, numba, pandas or others c-written and compiled python packages)

Is assumption that Julia's generators are terribly slow right?

What could we do to make it really fast?

6
paths(10) gives Base.Iterators.Flatten{Base.Generator{Base.Iterators.Flatten{Base.Generator{Base.Iterators.Flatten{Base.Generator{Base.Iterators.Flatten{Base.Generator{Base.Iterators.Flatten{Base.Generator{Base.Iterators.Flatten{Base.Generator{Base.Iterators.Flatten{Base.Generator{Base.Iterators.Flatten{Base.Generator{Base.Iterators.Flatten{Base.Generator{Base.Generator{UnitRange{Int64},##15#16},#enlarge}},#enlarge}},#enlarge}},#enlarge}},#enlarge}},#enlarge}},#enlarge}},#enlarge}},#enlarge}}(Base.Generator{Base.Iterators.F... I don't think this looks like its going to be fast.daycaster
Sorry if this is obvious, but you are running the julia one twice and timing the second run not the first right? Otherwise you are just timing the compile time...Colin T Bowers
I know it's cheating but the most effective way to get what you want is likely enumerate_paths (.pathcounts) in the LightGraphs package.Michael K. Borregaard
Looking at the results of the proposed Julia code, I'm not sure if the numbers are correct. For example, eyeballing: [i for i in paths(3) if i[1] == 1] I can't see any paths where the king would return to position 1. So instead of the 15 paths from positon 1 this should be 18, right? I might be misunderstanding the problem proposition here.niczky12
yes, right now generators aren't the best way to program but they work. In any case you need to optimize it, you can use a loop, so this has an easy workaround for the time being. There are a few type-inference and inlining issues that come up with generators that cause this, for now, which hopefully get fixed in a 1.x.Chris Rackauckas

6 Answers

6
votes
const NEIG_py = [[1, 4, 5], [0, 2, 4, 5, 6], [1, 3, 5, 6, 7], [2, 6, 7], [0, 1, 5, 8, 9], [0, 1, 2, 4, 6, 8, 9, 10], [1, 2, 3, 5, 7, 9, 10, 11], [2, 3, 6, 10, 11], [4, 5, 9, 12, 13], [4, 5, 6, 8, 10, 12, 13, 14], [5, 6, 7, 9, 11, 13, 14, 15], [6, 7, 10, 14, 15], [8, 9, 13], [8, 9, 10, 12, 14], [9, 10, 11, 13, 15], [10, 11, 14]];
const NEIG = [n.+1 for n in NEIG_py];

function expandto(n, path, targetlen)
    length(path) >= targetlen && return n+1
    for loc in NEIG[path[end]]
        loc in path && continue
        n = expandto(n, (path..., loc), targetlen)
    end
    n
end

function npaths(targetlen)
    n = 0
    for i = 1:16
        path = (i,)
        n = expandto(n, path, targetlen)
    end
    n
end

Benchmark (after executing once for JIT-compilation):

julia> @time npaths(10)
  0.069531 seconds (5 allocations: 176 bytes)
905776

which is considerably faster.

6
votes

Julia's "better performance" than Python isn't magical. Most of it stems directly from the fact that Julia can figure out what each variable's type within a function will be, and then compile highly specialized code for those specific types. This even applies to the elements in many containers and iterables like generators; Julia often knows ahead of time what type the elements will be. Python isn't able to do this analysis nearly as easily (or at all, in many cases), so its optimizations have focused on improving the dynamic behaviors.

In order for Julia's generators to know ahead of time what kinds of types they might produce, they encapsulate information about both the operation they perform and the object they iterate over in the type:

julia> (1 for i in 1:16)
Base.Generator{UnitRange{Int64},getfield(Main, Symbol("##27#28"))}(getfield(Main, Symbol("##27#28"))(), 1:16)

That weird ##27#28 thing is the type of an anonymous function that simply returns 1. By the time the generator gets to LLVM, it knows enough to perform quite a large number of optimizations:

julia> function naive_sum(c)
           s = 0
           for elt in c
               s += elt
           end
           s
       end
       @code_llvm naive_sum(1 for i in 1:16)

; Function naive_sum
; Location: REPL[1]:2
define i64 @julia_naive_sum_62385({ { i64, i64 } } addrspace(11)* nocapture nonnull readonly dereferenceable(16)) {
top:
; Location: REPL[1]:3
  %1 = getelementptr inbounds { { i64, i64 } }, { { i64, i64 } } addrspace(11)* %0, i64 0, i32 0, i32 0
  %2 = load i64, i64 addrspace(11)* %1, align 8
  %3 = getelementptr inbounds { { i64, i64 } }, { { i64, i64 } } addrspace(11)* %0, i64 0, i32 0, i32 1
  %4 = load i64, i64 addrspace(11)* %3, align 8
  %5 = add i64 %4, 1
  %6 = sub i64 %5, %2
; Location: REPL[1]:6
  ret i64 %6
}

It may take a minute to parse through the LLVM IR there, but you should be able to see that it's just extracting the endpoints of the UnitRange (getelementptr and load), subtracting them from each other (sub) and adding one to compute the sum without a single loop.

In this case, though, it works against Julia: paths(10) has a ridiculously complicated type! You're iteratively wrapping that one generator in filters and flattens and yet more generators. It becomes so complicated, in fact, that Julia just gives up trying to figure out with it and decides to live with the dynamic behavior. And at this point, it no longer has an inherent advantage over Python — in fact specializing on so many different types as it recursively walks through the object would be a distinct handicap. You can see this in action by looking at @code_warntype start(1 for i in paths(10)).

My rule of thumb for Julia's performance is that type-stable, devectorized code that avoids allocations is typically within an factor of 2 of C, and dynamic, unstable, or vectorized code is within an order of magnitude of Python/MATLAB/other higher level languages. Often it's a bit slower simply because the other higher level languages have pushed very hard to optimize their case, whereas the majority of Julia's optimizations have been focused on the type-stable side of things. This deeply nested construct puts you squarely in the dynamic camp.

So are Julia's generators terribly slow? Not inherently so; it's just when they become so deeply nested like this that you hit this bad case.

2
votes

Not following the same algorithm (and don't know how fast Python would be doing it like this), but with the following code Julia is basically the same for solutions of length=10, and much better for solutions of length=16

In [48]: %timeit sum(1 for path in paths(10))                                                                                                          
1.52 s ± 11.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)                                                                                    

julia> @time sum(1 for path in pathsr(10))                                                                                                             
  1.566964 seconds (5.54 M allocations: 693.729 MiB, 16.24% gc time)                                                                                   
905776                                                                                                                                                 

In [49]: %timeit sum(1 for path in paths(16))                                                                                                          
19.3 s ± 15.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)                                                                                    

julia> @time sum(1 for path in pathsr(16))                                                                                                             
  6.491803 seconds (57.36 M allocations: 9.734 GiB, 33.79% gc time)                                                                                    
343184  

Here is the code. I just learnt about tasks/channels yesterday, so probably it can be done better:

const NEIG = [[1, 4, 5], [0, 2, 4, 5, 6], [1, 3, 5, 6, 7], [2, 6, 7], [0, 1, 5, 8, 9], [0, 1, 2, 4, 6, 8, 9, 10], [1, 2, 3, 5, 7, 9, 10, 11], [2, 3, 6, 10, 11], [4, 5, 9, 12, 13], [4, 5, 6, 8, 10, 12, 13, 14], \
[5, 6, 7, 9, 11, 13, 14, 15], [6, 7, 10, 14, 15], [8, 9, 13], [8, 9, 10, 12, 14], [9, 10, 11, 13, 15], [10, 11, 14]];

function enlarger(num::Int,len::Int,pos::Int,sol::Array{Int64,1},c::Channel)
    if pos == len
        put!(c,copy(sol))
    elseif pos == 0
        for j=0:num
            sol[1]=j
            enlarger(num,len,pos+1,sol,c)
        end
        close(c)
    else
        for i in NEIG[sol[pos]+1]
            if !in(i,sol[1:pos])
                sol[pos+1]=i
                enlarger(num,len,pos+1,sol,c)
            end
        end
    end
end

function pathsr(len)
    c=Channel(0)
    sol = [0 for i=1:len]
    @schedule enlarger(15,len,0,sol,c)
    (i for i in c)
end
2
votes

Following tholy's answer, since tuples seem to be very fast. This is like my previous code, but with the tuple stuff, and it gets substantially better results:

julia> @time sum(1 for i in pathst(10))
  1.155639 seconds (1.83 M allocations: 97.632 MiB, 0.75% gc time)
905776

julia> @time sum(1 for i in pathst(16))
  1.963470 seconds (1.39 M allocations: 147.555 MiB, 0.35% gc time)
343184 

The code:

const NEIG = [[1, 4, 5], [0, 2, 4, 5, 6], [1, 3, 5, 6, 7], [2, 6, 7], [0, 1, 5, 8, 9], [0, 1, 2, 4, 6, 8, 9, 10], [1, 2, 3, 5, 7, 9, 10, 11], [2, 3, 6, 10, 11], [4, 5, 9, 12, 13], [4, 5, 6, 8, 10, 12, 13, 14], [5, 6, 7, 9, 11, 13, 14, 15], [6, 7, 10, 14, 15], [8, 9, 13], [8, 9, 10, 12, 14], [9, 10, 11, 13, 15], [10, 11, 14]];

function enlarget(path,len,c::Channel)
    if length(path) >= len
        put!(c,path)
    else
        for loc in NEIG[path[end]+1]
            loc in path && continue
            enlarget((path..., loc), len,c)
        end
        if length(path) == 1
            path[1] == 15 ? close(c) : enlarget((path[1]+1,),len,c)
        end
    end
end

function pathst(len)
    c=Channel(0)
    path=(0,)
    @schedule enlarget(path,len,c)
    (i for i in c)
end
2
votes

Since everybody is writing an answer... here is another version, this time using Iterators, which are kind-of more idiomatic than generators in current Julia (0.6.1). Iterators offer many of the benefits generators have. The iterator definition is in the following lines:

import Base.Iterators: start, next, done, eltype, iteratoreltype, iteratorsize

struct SAWsIterator
    neigh::Vector{Vector{Int}}
    pathlen::Int
    pos::Int
end

SAWs(neigh, pathlen, pos) = SAWsIterator(neigh, pathlen, pos)

start(itr::SAWsIterator) = 
    ([itr.pos ; zeros(Int, itr.pathlen-1)], Vector{Int}(itr.pathlen-1),
     2, Ref{Bool}(false), Ref{Bool}(false))

@inline next(itr::SAWsIterator, s) = 
    ( s[4][] ? s[4][] = false : calc_next!(itr, s) ; 
      (s[1], (s[1], s[2], itr.pathlen, s[4], s[5])) )

@inline done(itr::SAWsIterator, s) = ( s[4][] || calc_next!(itr, s) ; s[5][] )

function calc_next!(itr::SAWsIterator, s)
    s[4][] = true ; s[5][] = false
    curindex = s[3]
    pathlength = itr.pathlen
    path, options = s[1], s[2]
    @inbounds while curindex<=pathlength
        curindex == 1 && ( s[5][] = true ; break )
        startindex = path[curindex] == 0 ? 1 : options[curindex-1]+1
        path[curindex] = 0
        i = findnext(x->!(x in path), neigh[path[curindex-1]], startindex)
        if i==0
            path[curindex] = 0 ; options[curindex-1] = 0 ; curindex -= 1
        else
            path[curindex] = neigh[path[curindex-1]][i]
            options[curindex-1] = i ; curindex += 1
        end
    end
    return nothing
end

eltype(::Type{SAWsIterator}) = Vector{Int}
iteratoreltype(::Type{SAWsIterator}) = Base.HasEltype()
iteratorsize(::Type{SAWsIterator}) = Base.SizeUnknown()

Cut-and-pasting the definition above works. The term SAW was used as an acronym of Self Avoiding Walk, which is sometimes used in mathematics for such a path.

Now, to use/test this iterator, the following code can be executed:

allSAWs(neigh, pathlen) = 
  Base.Flatten(SAWs(neigh,pathlen,k) for k in eachindex(neigh))

iterlength(itr) = mapfoldl(x->1, +, 0, itr)

using Base.Test

const neigh = [[2, 5, 6], [1, 3, 5, 6, 7], [2, 4, 6, 7, 8], [3, 7, 8], 
  [1, 2, 6, 9, 10], [1, 2, 3, 5, 7, 9, 10, 11], [2, 3, 4, 6, 8, 10, 11, 12], 
  [3, 4, 7, 11, 12], [5, 6, 10, 13, 14], [5, 6, 7, 9, 11, 13, 14, 15], 
  [6, 7, 8, 10, 12, 14, 15, 16], [7, 8, 11, 15, 16], [9, 10, 14], 
  [9, 10, 11, 13, 15], [10, 11, 12, 14, 16], [11, 12, 15]]

@test iterlength(allSAWs(neigh, 10)) == 905776

for (i,path) in enumerate(allSAWs(neigh, 10))
    if i % 100_000 == 0
        @show i,path
    end
end

@time iterlength(allSAWs(neigh, 10))

It is relatively readable, and the output looks like this:

(i, path) = (100000, [2, 5, 10, 14, 9, 6, 7, 12, 15, 11])
(i, path) = (200000, [4, 3, 8, 7, 6, 10, 14, 11, 16, 15])
(i, path) = (300000, [5, 10, 11, 16, 15, 14, 9, 6, 7, 3])
(i, path) = (400000, [8, 3, 6, 5, 2, 7, 11, 14, 15, 10])
(i, path) = (500000, [9, 14, 10, 5, 2, 3, 8, 11, 6, 7])
(i, path) = (600000, [11, 16, 15, 14, 10, 6, 3, 8, 7, 12])
(i, path) = (700000, [13, 10, 15, 16, 11, 6, 2, 1, 5, 9])
(i, path) = (800000, [15, 11, 12, 7, 2, 3, 6, 1, 5, 9])
(i, path) = (900000, [16, 15, 14, 9, 5, 10, 7, 8, 12, 11])
  0.130755 seconds (4.16 M allocations: 104.947 MiB, 11.37% gc time)
905776

0.13s is not too bad considering this is not as optimized as @tholy's answer, or some others'. Some tricks used in the other answers are deliberately not used here, specifically:

  • recursion basically uses the stack as a quick way to allocate.
  • Using tuples combined with specialization hides some run-time complexity in the first compile of methods for each tuple signature.

An optimization not seen in the answers yet could be important is using an efficient Bool array or Dict to speed up the check if a vertex was already used in the path. In this answer, the findnext triggers an allocation, which can be avoided and then this answer will be closer to the minimal memory allocation count.

1
votes

This is my quick and dirty cheating experiment (I promised to add it here in comment) where I am trying to speedup Angel's code:

const NEIG_py = [[1, 4, 5], [0, 2, 4, 5, 6], [1, 3, 5, 6, 7], [2, 6, 7], [0, 1, 5, 8, 9], [0, 1, 2, 4, 6, 8, 9, 10], [1, 2, 3, 5, 7, 9, 10, 11], [2, 3, 6, 10, 11], [4, 5, 9, 12, 13], [4, 5, 6, 8, 10, 12, 13, 14], [5, 6, 7, 9, 11, 13, 14, 15], [6, 7, 10, 14, 15], [8, 9, 13], [8, 9, 10, 12, 14], [9, 10, 11, 13, 15], [10, 11, 14]];
const NEIG = [n.+1 for n in NEIG_py]

function enlargetc(path,len,c::Function)
    if length(path) >= len
        c(path)
    else
        for loc in NEIG[path[end]]
            loc in path && continue
            enlargetc((path..., loc), len,c)
        end
        if length(path) == 1
            if path[1] == 16 return
            else enlargetc((path[1]+1,),len,c)
            end
        end
    end
end

function get_counter()
    let helper = 0
        function f(a)
            helper += 1
            return helper
        end
        return f
    end
end

counter = get_counter()
@time enlargetc((1,), 10, counter)  # 0.481986 seconds (2.62 M allocations: 154.576 MiB, 5.12% gc time)
counter.helper.contents  # 905776

EDIT: time in comment is without recompilation! After recompilation it was 0.201669 seconds (2.53 M allocations: 150.036 MiB, 10.77% gc time).