I am trying to compare sets of variables(X
) that are stored in two dataframes (foo
, bar
). Each X
is a unique independent variable that has up to 10 values of Y
associated with it. I would like to compare every foo.X with every bar.X by comparing the number of Y
values they have in common - so the output could be a matrix with axes of foo.x by bar.x in length.
this simple example of foo and bar would want to return a 2x2 matrix comparing a,b with c,d:
foo <- data.frame(x= c('a', 'a', 'a', 'b', 'b', 'b'), y=c('ab', 'ac', 'ad', 'ae', 'fx', 'fy'))
bar <- data.frame(x= c('c', 'c', 'c', 'd', 'd', 'd'), y=c('ab', 'xy', 'xz', 'xy', 'fx', 'xz'))
EDIT:
I've left the following code for other newbies to learn from (for loops are effectvie but probably very suboptimal), but the two solutions below are effective. In particular Ramnath's use of data.table is very effective when dealing with very large dataframes.
store the dataframes as lists where the values of y are stored using the stack
function
foo.list <- dlply(foo, .(x), function(x) stack(x, select = y))
bar.list <- dlply(bar, .(x),function(x) stack(x, select = y))
write a function for comparing membership in the two stacked lists
comparelists <- function(list1, list2) {
for (i in list1){
for (j in list2){
count <- 0
if (i[[1]] %in% j[[1]]) count <- count + 1
}
}
return count
}
write an output matrix
output.matrix <- matrix(1:length(foo.list), 1:length(bar.list))
for (i in foo.list){
for (j in bar.list){
output.matrix[i,j] <- comparelists(i,j)
}
}