Julia: optimize function

Question

I'm trying to learn to write good julia code. I would like to code up the following statistic.

(note 1{A} = 1 if A true, 0 if A false)

where

and

function cohens_kappa(x::Vector{Int}, k::Int)
  support = unique(x)
  m = length(support)
  n = length(x)
  y = BitArray(n, m)
  for j in eachindex(support)
    y[:,j] = (X .== support[j])
  end

  num = 0.0
  den = 0.0
  for j in eachindex(support)
    pjjk = sum(y[(1 + k):n, j] & y[1:(n - k), j]) / (n - k)
    pj = sum(y[:, j]) / n

    num += pjjk - pj ^ 2
    den += (1 / m) - pj ^ 2
  end
  return (num / den)
end

Is this most efficient way to code this up?

EDIT: Thanks for all the suggestions guys. Can you explain why your code is more efficient? I'd like to learn how to continue to write good code in the future.

testing against @user3580870 two examples we have

@time [cohens_kappa(X, k) for k in 1:15]
  0.000507 seconds (1.58 k allocations: 269.016 KB)

@time [cohens_kappa2(X, k) for k in 1:15]
  0.000336 seconds (166 allocations: 12.375 KB)

@time [cohens_kappa3(X, k) for k in 1:15]
  0.000734 seconds (303 allocations: 84.109 KB)

Looks like your second suggestion is not as fast, but it makes less allocations than my original version, so might be faster for very large vectors.

Suppose the support of the vector is K values and the length N. The algorithm in the Question generates K vectors. The generation takes K*N comparisons. In comparison, the algorithm in the Answers scans the vector only a fixed number of times (two). When K is large, the Answer algorithms should have greater advantage. (What is the K in your data?) — Dan Getz
The must-read for optimizing Julia code is docs.julialang.org/en/release-0.4/manual/performance-tips . It's even recommended to re-read it once in a while (Julia is growing and changing). — Dan Getz
Since K < 10 in your data, the difference in efficiency is not striking. Try it with data having K=100 (e.g. X = rand(1:100,N) and make N large, so run-time is around 0.01secs to make timings better) — Dan Getz

Dan Getz Dan Getz · Accepted Answer · 2016-02-04T03:45:12

Here is another version, which gives a significant improvement:

   function cohens_kappa2(x::Vector{Int}, k::Int)
     d = Dict{Int,Int}()
     n = length(x)
     c1 = Int[]
     pnew = 0
     for i=1:n
       p = get(d,x[i],0)
       if p>0
         c1[p] += 1
       else
         pnew += 1
         d[x[i]] = pnew
         push!(c1,1)
       end
     end
     c2 = zeros(Int,pnew)
     for i=(k+1):n
       if x[i-k]==x[i] c2[d[x[i]]] += 1 ; end
     end
     num, dentmp = 0.0, 0.0
     for i=1:pnew
       pjjk = c2[i]/(n-k)
       pj = c1[i] / n
       num += pjjk - pj^2
       dentmp += pj^2
     end
     return (num / (1.0-dentmp))
   end

In general, optimization is almost always possible, but like extracting oil from nature, it comes with growing costs and effort for the programmer.

On a test case the above gave me 5x to 10x speedup. What is the result for your data?

Julia: optimize function

3 Answers