1
votes

I am getting performance degradation after parallelizing the code that is calculating graph centrality. Graph is relatively large, 100K vertices. Single threaded application take approximately 7 minutes. As recommended on julialang site (http://julia.readthedocs.org/en/latest/manual/parallel-computing/#man-parallel-computing) I adapted code and used pmap api in order to parallelize calculations. I started calculation with 8 processes (julia -p 8 calc_centrality.jl). To my surprise I got 10 fold slow down. Parallel process now take more than hour. I noticed that it take several minutes for parallel process to initialize and starts calculation. Even after all 8 CPUs are %100 busy with julia app, calculation is super slow.

Any suggestion how to improve parallel performance is appreciated.

calc_centrality.jl :

using Graphs
require("read_graph.jl")
require("centrality_mean.jl")

function main()
    file_name = "test_graph.csv"
    println("graph creation: ", file_name)
    g = create_generic_graph_from_file(file_name)
    println("num edges: ", num_edges(g))
    println("num vertices: ", num_vertices(g))

    data = cell(8)
    data[1] = {g, 1, 2500}
    data[2] = {g, 2501, 5000}
    data[3] = {g, 5001, 7500}
    data[4] = {g, 7501, 10000}
    data[5] = {g, 10001, 12500}
    data[6] = {g, 12501, 15000}
    data[7] = {g, 15001, 17500}
    data[8] = {g, 17501, 20000}
    cm = pmap(centrality_mean, data)
    println(cm)

end
println("Elapsed: ", @elapsed main(), "\n")

centrality_mean.jl

using Graphs

function centrality_mean(gr, start_vertex)
    centrality_cnt = Dict()
    vertex_to_visit = Set()
    push!(vertex_to_visit, start_vertex)

    cnt = 0
    while !isempty(vertex_to_visit)
        next_vertex_set = Set()
        for vertex in vertex_to_visit
            if !haskey(centrality_cnt, vertex)
                centrality_cnt[vertex] = cnt
                for neigh in out_neighbors(vertex, gr)
                    push!(next_vertex_set, neigh)
                end
            end
        end
        cnt += 1
        vertex_to_visit = next_vertex_set
    end
    mean([ v for (k,v) in centrality_cnt ])
end

function centrality_mean(data::Array{})
    gr = data[1]
    v_start = data[2]
    v_end = data[3]
    n = v_end - v_start + 1;
    cm = Array(Float64, n)
    v = vertices(gr)
    cnt = 0
    for i = v_start:v_end
        cnt += 1
        if cnt%10 == 0
            println(cnt)
        end
        cm[cnt] = centrality_mean(gr, v[i])
    end
    return cm
end
3
I think we would need to see some code to diagnose the problem. Do you have it hosted somewhere accessible (i.e. Github or gist)?spencerlyon2
take al look at the codedejan
That's a lot of graph moving/copying. How about using a DArray or SharedArray containing your graph or defining @everywhere g = ...? Then you won't need to specify g in your pmap call.rickhg12hs
Is your centrality_mean the same as the mean of the non-Inf row values of floyd_warshall(distance_matrix(g,ones(g.nedges)))? If so, how is it's performance?rickhg12hs
I modified code for centrality_mean from gist.github.com/SirVer/3353761. In essence centrality mean is special case, distances between vertices are equal, of closeness centrality explained in en.wikipedia.org/wiki/Centralitydejan

3 Answers

1
votes

I'm guessing this has nothing to do with parallelism. Your second centrality_mean method has no clue what type of objects gr, v_start, and v_end are. So it's going to have to use non-optimized, slow code for that "outer loop."

While there are several potential solutions, probably the easiest is to break up your function that receives the "command" from pmap:

function centrality_mean(data::Array{})
    gr = data[1]
    v_start = data[2]
    v_end = data[3]
    centrality_mean(gr, v_start, v_end)
end

function centrality_mean(gr, v_start, v_end)
    n = v_end - v_start + 1;
    cm = Array(Float64, n)
    v = vertices(gr)
    cnt = 0
    for i = v_start:v_end
        cnt += 1
        if cnt%10 == 0
            println(cnt)
        end
        cm[cnt] = centrality_mean(gr, v[i])
    end
    return cm
end

All this does is create a break, and gives julia a chance to optimize the second part (which contains the performance-critical loop) for the actual types of the inputs.

1
votes

Below is code with @everywhere suggested in comments by https://stackoverflow.com/users/1409374/rickhg12hs. That fixed performance issue!!!

test_parallel_pmap.jl

using Graphs
require("read_graph.jl")
require("centrality_mean.jl")

function main()
   @everywhere file_name = "test_data.csv"
   println("graph creation from: ", file_name)
   @everywhere data_graph = create_generic_graph_from_file(file_name)
   @everywhere data_graph_vertex = vertices(data_graph)
   println("num edges: ", num_edges(data_graph))
   println("num vertices: ", num_vertices(data_graph))

   range = cell(2)
   range[1] = {1, 25000}
   range[2] = {25001, 50000}
   cm = pmap(centrality_mean_pmap, range)
   for i = 1:length(cm)
       println(length(cm[i]))
  end
end

println("Elapsed: ", @elapsed main(), "\n")

centrality_mean.jl

using Graphs

function centrality_mean(start_vertex::ExVertex)
    centrality_cnt = Dict{ExVertex, Int64}()
    vertex_to_visit = Set{ExVertex}()
    push!(vertex_to_visit, start_vertex)

    cnt = 0
    while !isempty(vertex_to_visit)
        next_vertex_set = Set()
        for vertex in vertex_to_visit
            if !haskey(centrality_cnt, vertex)
                centrality_cnt[vertex] = cnt
                for neigh in out_neighbors(vertex, data_graph)
                    push!(next_vertex_set, neigh)
                end
            end
        end
       cnt += 1
       vertex_to_visit = next_vertex_set
    end
    mean([ v for (k,v) in centrality_cnt ])
end

function centrality_mean(v_start::Int64, v_end::Int64)
    n = v_end - v_start + 1;
    cm = Array(Float64, n)
    cnt = 0
    for i = v_start:v_end
        cnt += 1
        cm[cnt] = centrality_mean(data_graph_vertex[i])
    end
    return cm
end

function centrality_mean_pmap(range::Array{})
    v_start = range[1]
    v_end = range[2]
    centrality_mean(v_start, v_end)
end
0
votes

From the Julia page on parallel computing:

Julia provides a multiprocessing environment based on message passing to allow programs to run on multiple processes in separate memory domains at once.

If I interpret this right, Julia's parallelism requires message passing to synchonize the processess. If each individual process does only a little work, and then does a message-pass, the computation will be dominated by message-passing overhead and not doing any work.

I can't see from your code, and I don't know Julia well enough, to see where the parallelism breaks are. But you have a big complicated graph that may be spread willy-nilly across multiple processes. If they need to interact on walking across graph links, you'll have exactly that kind of overhead.

You may be able to fix it by precomputing a partition of the graph into roughly equal size, highly cohesive regions. I suspect that breakup requires that same type of complex graph processing that you already want to do, so you may have a chicken and egg problem to boot.

It may be that Julia offers you the wrong parallelism model. You might want a shared address space so that threads walking the graph dont' have to use messages to traverse arcs.