1
votes

I have two data frames, df1 and df2, with N rows and M columns each, all numeric. I would like to find rows that are partially identical between df1 and df2. A row will be considered "partially identical" if it identical in at least M/2 columns between df1 and df2.

For example,

df1 = data_frame(A = c(1,2,3,4), B = c(5,6,7,8), C = c(9, 10, 11,12), D = c(13,14,15,16))
df2 = data_frame(A = c(1,2,3,4), B = c(15, 16, 17, 18), C = c(9, 10, 100, 200), D = c(500, 600, 700, 800))

df1
# A tibble: 4 x 4
      A     B     C     D
    <dbl> <dbl> <dbl> <dbl>
1     1     5     9    13
2     2     6    10    14
3     3     7    11    15
4     4     8    12    16

df2
# A tibble: 4 x 4
      A     B     C     D
    <dbl> <dbl> <dbl> <dbl>
1     1    15     9   500
2     2    16    10   600
3     3    17   100   700
4     4    18   200   800

In this example, row 1 in df1 is partially similar to row 1 in df2, and the same for row 2.

Please note that the similar rows can have different indicies between the two data frames, so e.g. row 1 in df1 can be similar to row 1500 in df2 and so on, which makes it very computationally expensive (O(N^2)) to compare one-to-one.

My main problem is that the similarity can be in any M/2 columns, so it seems like the computational effort is choosing M/2 out of M and joining each time, and this is very expensive for large data sets.

I would really appreciate a dplyr solution which would be scalable for a large number of rows (a few millions) and a few hundred columns.

1

1 Answers

1
votes

Try this code:

rowSums(df1==df2)/ncol(df1)>=0.5
[1]  TRUE  TRUE FALSE FALSE

The output is a TRUE (similar) or FALSE (not similar)

If you want compare all columns permutations in df2, try this:

#All the permutation of df2 columns
require("gtools")
permut <- permutations(n = ncol(df2), r = ncol(df2), v = rep(1:ncol(df2)))

permut<-split(permut, seq(nrow(permut)))

#Compare all the permutatin function
f<-function(permut,df1,df2)
{

   out<-rowSums(df1==df2[,unlist(permut)])/ncol(df1)>=0.5
   return(out)
}

#Apply comparison  


    as.logical(apply((matrix(unlist(lapply(permut,f,df1=df1,df2=df2)),nrow=ncol(df2))),1,FUN=max))
[1]  TRUE  TRUE FALSE FALSE