1
votes

Sometimes I want to use a double for loop with an index to columns in a matrix, compute some value between them and assign to a cell in a matrix. A correlation table is an example of this. I was wondering if/how this can be done in data.table syntax. Here's the example as a for loop. How can I do the same thing in *data.table** even if it is slower this is more can it be done though faster would be nice. Note that we can't assume the value computer will give a symmetric matrix (i.e., y[i, j] != y[j, i] necessarily).

cos_sim <- function(x, y) x %*% y / sqrt(x%*%x * y%*%y)

x <- mtcars
y <- matrix(, nrow = ncol(x), ncol = ncol(x))

for (i in 1:ncol(x)) {
    for (j in 1:ncol(x)) {
        y[i, j] <- cos_sim(x[, i], x[, j])
    }
}

library(data.table)
x <- as.data.frame(x)
setDT(x)
3

3 Answers

2
votes

Another base R approach would be outer.

outer(x, x, FUN=Vectorize(cos_sim))
#          mpg       cyl      disp        hp      drat        wt      qsec
#mpg  1.0000000 0.8566168 0.7356738 0.7794276 0.9768897 0.8483280 0.9660715
#cyl  0.8566168 1.0000000 0.9656088 0.9689702 0.9241079 0.9828563 0.9414552
#disp 0.7356738 0.9656088 1.0000000 0.9576400 0.8266655 0.9659344 0.8599014
#hp   0.7794276 0.9689702 0.9576400 1.0000000 0.8717482 0.9492708 0.8750691
#drat 0.9768897 0.9241079 0.8266655 0.8717482 1.0000000 0.9183274 0.9859895
#wt   0.8483280 0.9828563 0.9659344 0.9492708 0.9183274 1.0000000 0.9484697
#qsec 0.9660715 0.9414552 0.8599014 0.8750691 0.9859895 0.9484697 1.0000000
#vs   0.7753943 0.4700802 0.3356976 0.3742408 0.7022767 0.5143092 0.7130090
#am   0.7421732 0.5030698 0.3505303 0.5007184 0.7101727 0.4575882 0.6169362
#gear 0.9672733 0.9177938 0.8172070 0.8812034 0.9903890 0.9076279 0.9723964
#carb 0.7581483 0.9082799 0.8604485 0.9450793 0.8549106 0.8943285 0.8346877
#            vs        am      gear      carb
#mpg  0.7753943 0.7421732 0.9672733 0.7581483
#cyl  0.4700802 0.5030698 0.9177938 0.9082799
#disp 0.3356976 0.3505303 0.8172070 0.8604485
#hp   0.3742408 0.5007184 0.8812034 0.9450793
#drat 0.7022767 0.7101727 0.9903890 0.8549106
#wt   0.5143092 0.4575882 0.9076279 0.8943285
#qsec 0.7130090 0.6169362 0.9723964 0.8346877
#vs   1.0000000 0.5188745 0.6788292 0.3655971
#am   0.5188745 1.0000000 0.7435907 0.5766850
#gear 0.6788292 0.7435907 1.0000000 0.8802046
#carb 0.3655971 0.5766850 0.8802046 1.0000000

It can be also made into data.table syntax, but the output is a matrix, so I wouldn't say that there would be any improvement in efficiency.

setDT(x)[,outer(.SD, .SD, FUN=Vectorize(cos_sim))]
1
votes

Here's one way:

x <- mtcars
setDT(x)

x[, lapply(.SD, function(xx) {
  lapply(x, function(yy) cos_sim(xx, yy))
  })]

The biggest difference between this and your original is really the use of apply in place of the for loops. It's data.table-ish in that it makes use of .SD, but one can also just do the following in base R:

sapply(x, function(xx) {
  sapply(x, function(yy) cos_sim(xx, yy))
})

I think it's more svelte and preferable vs. nested for loops, but not sure it's really taking special advantage of data.table per se

1
votes

Matrix algebra As far as efficiency goes, yeah, matrix operations are your best bet:

mx  <- as.matrix(x)
sx  <- 1 / sqrt( colSums(mx^2) )
res <- (t(mx) %*% mx) * (sx %*% t(sx))

This also gives you nice row and col labels, unlike the OP's for loop.


data.table This isn't really natural here, but...

meltx <- melt(x[,id:=.I], id.var="id"); x[,id:=NULL]
cartx <- meltx[meltx, on="id", allow.cartesian=TRUE]
res2  <- dcast(cartx[, cos_sim(value, i.value), by=.(v1=variable,v2=i.variable)], v1~v2)

You get a data.table back out in this case, if that's a plus.