4
votes

I'm trying to learn to write good julia code. I would like to code up the following statistic.

(note 1{A} = 1 if A true, 0 if A false)

enter image description here

where

enter image description here

and

enter image description here

function cohens_kappa(x::Vector{Int}, k::Int)
  support = unique(x)
  m = length(support)
  n = length(x)
  y = BitArray(n, m)
  for j in eachindex(support)
    y[:,j] = (X .== support[j])
  end

  num = 0.0
  den = 0.0
  for j in eachindex(support)
    pjjk = sum(y[(1 + k):n, j] & y[1:(n - k), j]) / (n - k)
    pj = sum(y[:, j]) / n

    num += pjjk - pj ^ 2
    den += (1 / m) - pj ^ 2
  end
  return (num / den)
end

Is this most efficient way to code this up?

EDIT: Thanks for all the suggestions guys. Can you explain why your code is more efficient? I'd like to learn how to continue to write good code in the future.

testing against @user3580870 two examples we have

@time [cohens_kappa(X, k) for k in 1:15]
  0.000507 seconds (1.58 k allocations: 269.016 KB)

@time [cohens_kappa2(X, k) for k in 1:15]
  0.000336 seconds (166 allocations: 12.375 KB)

@time [cohens_kappa3(X, k) for k in 1:15]
  0.000734 seconds (303 allocations: 84.109 KB)

Looks like your second suggestion is not as fast, but it makes less allocations than my original version, so might be faster for very large vectors.

3
Suppose the support of the vector is K values and the length N. The algorithm in the Question generates K vectors. The generation takes K*N comparisons. In comparison, the algorithm in the Answers scans the vector only a fixed number of times (two). When K is large, the Answer algorithms should have greater advantage. (What is the K in your data?) - Dan Getz
In my data k usually fairly small (less than 10) - bdeonovic
The must-read for optimizing Julia code is docs.julialang.org/en/release-0.4/manual/performance-tips . It's even recommended to re-read it once in a while (Julia is growing and changing). - Dan Getz
Since K < 10 in your data, the difference in efficiency is not striking. Try it with data having K=100 (e.g. X = rand(1:100,N) and make N large, so run-time is around 0.01secs to make timings better) - Dan Getz

3 Answers

3
votes

Here is another version, which gives a significant improvement:

   function cohens_kappa2(x::Vector{Int}, k::Int)
     d = Dict{Int,Int}()
     n = length(x)
     c1 = Int[]
     pnew = 0
     for i=1:n
       p = get(d,x[i],0)
       if p>0
         c1[p] += 1
       else
         pnew += 1
         d[x[i]] = pnew
         push!(c1,1)
       end
     end
     c2 = zeros(Int,pnew)
     for i=(k+1):n
       if x[i-k]==x[i] c2[d[x[i]]] += 1 ; end
     end
     num, dentmp = 0.0, 0.0
     for i=1:pnew
       pjjk = c2[i]/(n-k)
       pj = c1[i] / n
       num += pjjk - pj^2
       dentmp += pj^2
     end
     return (num / (1.0-dentmp))
   end

In general, optimization is almost always possible, but like extracting oil from nature, it comes with growing costs and effort for the programmer.

On a test case the above gave me 5x to 10x speedup. What is the result for your data?

2
votes

There is one more optimization available. To find the autocorrelation you need only work with the reduced vector of matches found by logical indexing:

function cohens_kappa_2(x::vector{Int},k:Int)

  ...

  # Autocorrelation dictionary
  dxx=Dict{Int,Int}()

  # k-step element-wise matches
  xx=(x[1:end-k])[x[1:end-k] .== x[1+k:end]]

  # Populate the dictionary
  for exx in xx
    dxx[exx] += 1 # Warning! pseudo-code
  end

  ...

end

The key insight is that the autocorrelation part of the statistic only "cares about" elements that match the element k-steps ahead, so we can check for that first.

Note that a dictionary is used because its hashing allows for linear time resolution of unique elements. In comparison the unique function will implement some form of sorting, heap, or tree structure which will have O(N * lnN) performance on average.

1
votes

Adding yet another version. This one is slightly slower than the longer more unfolded version given in my other Answer. On the other hand, it is much cleaner and uses DataStructures package's counter. In the code, cc counts all elements and their freqeuncies and cc2 counts k distance pairs of identical elements. And the source:

using DataStructures     # install with Pkg.add("DataStructures")

function cohens_kappa3(x::Vector{Int}, k::Int)
      n = length(x)
      cc = counter(x)
      cc2 = counter(x[[i<=k ? false : x[i]==x[i-k] for i=1:n]])
      num, den = 0.0,1.0
      for (val,freq) in cc
          pj2 = (freq/n)^2
          num += cc2[val]/(n-k)-pj2
          den -= pj2
      end
      return num/den
end