13
votes

Is there a fast way of finding which rows in matrix A are present in matrix B? e.g.

m1 = matrix(c(1:6), ncol=2, byrow = T); m2 = matrix(c(1:4), ncol=2, byrow=T);

and the result would be 1, 2.

The matrices do not have the same number of rows (number of columns is the same), and they are somewhat big - from 10^6 - 10^7 number of rows.

The fastest way of doing it, that I know of for now, is:

duplicated(rbind(m1, m2))

Tnx!

2
Your solution with duplicated would also return any rows that get repeated within a matrix, even if it appears in only one of the two matrices. Anyway, @MatthewDowle's answer is great. - David Robinson
data.table might be faster because it doesn't use do.call("paste" under the hood. If you prefer duplicated to M2[M1] then duplicated(as.data.table(rbind(m1,m2))) might be faster, for the same reason. Interested to see your timings. - Matt Dowle
@David Oh yes, good point about the duplicated approach. - Matt Dowle
Duplicate? stackoverflow.com/questions/7943695/matrix-in-matrix (or, at least, 'Look here for other options!') - Matt Parker

2 Answers

23
votes

A fast way for that size should be :

require(data.table)
M1 = setkey(data.table(m1))
M2 = setkey(data.table(m2))
na.omit(
    M2[M1,which=TRUE]
)
[1] 1 2
-1
votes

I created this function which will return the original ID. For example you want to match matrix x to matrix y, it will return the match ID of y.

rowiseMatch2 <- function(x,y){
  require(data.table)
  keycols <- colnames(x)
  x <- cbind(x, id=1:nrow(x))
  y <- cbind(y, id=1:nrow(y))
  m1 = data.table(x)
  setkeyv(m1, keycols)
  m2 = data.table(y)
  setkeyv(m2, keycols)
  m1id <- m1$id
  m2id <- m2$id

  m1$id <- NULL
  m2$id <- NULL

  m <- na.omit(m2[m1,which=TRUE])
  mo <- m2id[m][order(m1id)]

  if(length(mo) == nrow(x)){
    cat("Complete match!\n")
  }else{
    cat("Uncomplete match, match percentage is:", round(length(mo)/nrow(x), 4)*100, "%\n")
  }
  return(as.integer(mo))
}