1
votes

I am very new to the Julia programming language and I am testing out some Euclidean distance operations I typically perform in other languages. The functions work if calling it serially, but the pmap calls are not returning the desired results. Could someone take a look and let me know if I am going about this the right way? is pmap even the best way to approach this?

using Distributed

#Example data
d1 = randn(50000,3) 
d2 = randn(50000,3) 

First Function: Euclidean Distance Matrix

function EDM(m1, m2)
    n1 = size(m1, 1)
    n2 = size(m2,1)
    k = size(m1, 2)
    Dist = zeros(n1,n2)
    for i in 1:n1
        for j in 1:n2
            dtemp = 0
            for a in 1:k
                dtemp += (m1[i,a] - m2[j,a]) ^ 2
            end
            Dist[i,j] = sqrt(dtemp)
        end
    end
    return Dist
end

#pmap call
function pmap_EDM(m1,m2)
    return pmap(EDM, m1, m2)
end

Second Function: Minimum Euclidean Distances Unidirectional

function MED(m1, m2)
    n1 = size(m1, 1)
    n2 = size(m2,1)
    k = size(m1, 2)
    Dist = zeros(n1,1)
    for i in 1:n1
        dsum = Inf
        for j in 1:n2
            dtemp = 0
            for a in 1:k
                dtemp += (m1[i,a] - m2[j,a]) ^ 2
            end
            dtemp = sqrt(dtemp)
            if dtemp < dsum
                dsum = copy(dtemp)
            end
        end
        Dist[i,1] = dsum
    end
    return Dist
end

#pmap call
function pmap_MED(m1,m2)
    return pmap(MED, m1, m2)
end

Third Function: Minimum Euclidean Distances and Corresponding Indices Unidirectional

function MEDI(m1, m2)
    n1 = size(m1, 1)
    n2 = size(m2,1)
    k = size(m1, 2)
    Dist = zeros(n1,2)
    for i in 1:n1
        dsum = Inf
        dsum_ind = 0
        for j in 1:n2
            dtemp = 0
            for a in 1:k
                dtemp += (m1[i,a] - m2[j,a]) ^ 2
            end
            dtemp = sqrt(dtemp)
            if dtemp < dsum
                dsum = copy(dtemp)
                dsum_ind = copy(j)
            end
        end
        Dist[i,1] = dsum
        Dist[i,2] = dsum_ind
    end
    return Dist
end

#pmap call
function pmap_MEDI(m1,m2)
    return pmap(MEDI, m1, m2)
end

Calling functions

r1 = EDM(d1,d2) #serial
r2 = pmap_EDM(d1,d2)

r3 = MED(d1,d2) #serial
r4 = pmap_MED(d1,d2)

r5 = MEDI(d1,d2) #serial
r6 = pmap_MEDI(d1,d2)

Edited:

The first function should return a simple Euclidean distance matrix with the distances between each row in one array to every row in the second array. The second and third functions are deviations of this to return a subset of those distances based on the minimum distance for each row in one array to every other row in another array (with the third function returning the index position of the minimum distance). The distances do not appear to be calculated correctly and the latter two functions using pmap are returning an nx3 matrix rather than nx1 and nx2 respectively.

Edited 2: example using smaller data set to show results

d1 = randn(5,3) 
d2 = randn(5,3) 

julia> EDM(d1,d2)
5×5 Array{Float64,2}:
 2.60637  3.18867  1.0745    2.60328  1.58608 
 1.2763   2.31037  3.04379   2.74113  2.00452 
 1.70024  2.07731  3.12397   2.60893  2.05932 
 2.44581  1.57345  0.910323  1.08718  0.407675
 3.42936  1.13001  2.18345   1.08764  1.70883 

julia> pmap_EDM(d1,d2)
5×3 Array{Array{Float64,2},2}:
 [0.397928]  [2.39283]   [0.953501]
 [1.06776]   [0.815057]  [1.87973] 
 [0.151963]  [3.05161]   [0.650967]
 [0.571021]  [0.275554]  [0.883151]
 [0.109293]  [0.635398]  [1.58254] 

julia> MED(d1,d2)
5×1 Array{Float64,2}:
 1.0744953977891307 
 1.2762979313081781 
 1.7002448697495505 
 0.40767454400155695
 1.0876399289364607 

julia> pmap_MED(d1,d2)
5×3 Array{Array{Float64,2},2}:
 [0.397928]  [2.39283]   [0.953501]
 [1.06776]   [0.815057]  [1.87973] 
 [0.151963]  [3.05161]   [0.650967]
 [0.571021]  [0.275554]  [0.883151]
 [0.109293]  [0.635398]  [1.58254] 

julia> MEDI(d1,d2)
5×2 Array{Float64,2}:
 1.0745    3.0
 1.2763    1.0
 1.70024   1.0
 0.407675  5.0
 1.08764   4.0

julia> pmap_MEDI(d1,d2)
5×3 Array{Array{Float64,2},2}:
 [0.397928 1.0]  [2.39283 1.0]   [0.953501 1.0]
 [1.06776 1.0]   [0.815057 1.0]  [1.87973 1.0] 
 [0.151963 1.0]  [3.05161 1.0]   [0.650967 1.0]
 [0.571021 1.0]  [0.275554 1.0]  [0.883151 1.0]
 [0.109293 1.0]  [0.635398 1.0]  [1.58254 1.0] 

Edited 3: @distributed version of function two

using Distributed
using SharedArrays

#Minimum Euclidean Distances Unidirectional
@everywhere function MD(v1, m2)
    n = size(m2, 1)
    dsum = Inf
    for j in 1:n
        dtemp = sqrt((v1[1] - m2[j,1]) ^ 2 + (v1[2] - m2[j,2]) ^ 2 + (v1[3] - m2[j,3]) ^ 2)
        if dtemp < dsum
            dsum = dtemp
        end
    end
    return dsum
end

function MED(m1, m2)
    n1 = size(m1,1)
    Dist = SharedArray{Float64}(n1)
    m3 = SharedArray{Float64}(m2)
    @sync @distributed for k in 1:n1
        Dist[k] = MD(m1[k,:], m3)
    end
    return Dist
end
1
What are you expecting to see and what are you seeing that does not meet those expectations?Engineero
@Engineero I've updated the question to specify this. F1: expecting to see a simple Euclidean distance matrix. F2: expecting to see the minimum distances rather than the entire distance matrix. F3: expecting to see the minimum distances and the index position the minimum distance occurs in the second data set.JJL
I'm no expert on pmap, but two other comments about your routines: 1) Julia uses column-major ordering (unlike C), so iterating across columns like you are on the innermost loop is significantly slower than if you rephrased your problem to be iterating across rows. 2) The Distances package is a really well-written Julia package, so you're looking for optimal speed, you should check it out.Colin T Bowers
A third comment: subtypes of Number, eg floats and ints, are immutable in Julia, so an operation like copy(1) is allowed, but not particularly meaningful. You can do x = 0 ; y = x ; y = 1, and you'll note that x is still 0. copy is typically used for mutable types, like arrays.Colin T Bowers
@ColinTBowers Thanks for the information. I'll check out the Distances package, but at first glance it doesn't appear to provide a way to calculate an entire distance matrix?JJL

1 Answers

1
votes

I did not went into the details of your code, but could it be that you apply pmap at the wrong code level? For instance if you have the following serial code

for i = 1:imax
    # do some work
end

You would write this as:

function function_for_single_iteration(i)
     # do some work
end

pmap(function_for_single_iteration,1:imax)

Essentially the pmap replaces a (outer) for loop. Before using pmap, I usually first use the serial map function to check that I have the same results.

Note that pmap and map would return a vector. In your case, probably a vector of vectors of distances. You would need to use cat to turn this into a matrix.