I have two data frames, df1 and df2, with N rows and M columns each, all numeric. I would like to find rows that are partially identical between df1 and df2. A row will be considered "partially identical" if it identical in at least M/2 columns between df1 and df2.
For example,
df1 = data_frame(A = c(1,2,3,4), B = c(5,6,7,8), C = c(9, 10, 11,12), D = c(13,14,15,16))
df2 = data_frame(A = c(1,2,3,4), B = c(15, 16, 17, 18), C = c(9, 10, 100, 200), D = c(500, 600, 700, 800))
df1
# A tibble: 4 x 4
A B C D
<dbl> <dbl> <dbl> <dbl>
1 1 5 9 13
2 2 6 10 14
3 3 7 11 15
4 4 8 12 16
df2
# A tibble: 4 x 4
A B C D
<dbl> <dbl> <dbl> <dbl>
1 1 15 9 500
2 2 16 10 600
3 3 17 100 700
4 4 18 200 800
In this example, row 1 in df1 is partially similar to row 1 in df2, and the same for row 2.
Please note that the similar rows can have different indicies between the two data frames, so e.g. row 1 in df1 can be similar to row 1500 in df2 and so on, which makes it very computationally expensive (O(N^2)) to compare one-to-one.
My main problem is that the similarity can be in any M/2 columns, so it seems like the computational effort is choosing M/2 out of M and joining each time, and this is very expensive for large data sets.
I would really appreciate a dplyr solution which would be scalable for a large number of rows (a few millions) and a few hundred columns.