In R, discover rows which partially match rows in another data frame

Question

I have the following two data frames:

> df1
# A tibble: 4 x 4
    x     y     z     w
  <dbl> <dbl> <dbl> <dbl>
    4     5     8     9
    4     6     7     4
    3     6     7    10
    8     2     8     9
> df2
# A tibble: 4 x 4
    x     y     z     w
  <dbl> <dbl> <dbl> <dbl>
    6     2     7     9
    2     6     7    10
    4     5     8    12
    4     5     8     3

I would like to discover which rows in df2 have a match in df1, where a match means being identical in at least n/2 columns.

So in this example, row 1 in df2 is a match to row 4 in df1 (columns 1 and 3), row 2 in df2 matches row 2 in df1 on columns 2 and 3 and row 3 on columns 2,3,4 and so on.

I also have to save the location of the repeating rows and the columns on which they match.

For small data sets, I could replicate both data sets and subtract them and count the zeros. However what I need is a solution which would work on very large data sets (~20K rows).

Any ideas? A dplyr solution (rather than a data.table) would be highly appreciated.

What do you exactly mean when you have written as For small data sets, I could replicate both data sets and subtract them and count the zeros — MKR
I could replicate each dataset 4 times, so that each row in df1 will be compared to each row in df2. — Omry Atia
That's what I had guessed. In what format do you expect result to be? — MKR
df1 will now be a 16x4 data frame, and so will df2. If I now subtract them I will have a 16x4 data frame of zeros and non-zeros, and the location of these zeros will indicate which rows are matching — Omry Atia
Have a look at answer. Now if you are looking for having only those rows that have more than 50% matching columns then we can modify function to do the same and result only those rows meeting criteria. — MKR

sbha sbha · Accepted Answer · 2018-04-03T16:44:22

This final output might not be the ideal format, but it should at least have the information you're looking for and work with many more fields/columns.

df1 <- read.table(text =
             "x     y     z     w
              4     5     8     9
              4     6     7     4
              3     6     7    10
              8     2     8     9",
              header = T)

df2 <- read.table(text =
             "x     y     z     w
              6     2     7     9
              2     6     7    10
              4     5     8    12
              4     5     8     3",
              header = T)


library(dplyr)
library(tidyr)

Add a row ID number to each data frame and reshape the data from wide to long with gather. (I'm assuming each row can be treated as a unique id):

df1 <- df1 %>% 
  mutate(df1_id = row_number()) %>%
  gather(field, value, x:w) %>% 
  arrange(df1_id)

df2 <- df2 %>% 
  mutate(df2_id = row_number()) %>% 
  gather(field, value, x:w) %>% 
  arrange(df2_id)

Join the two data frames with an inner_join on field/column and value. Then use group and filter to get only field and value combinations that have two or more matches

df2 %>% 
  inner_join(df1, by = c('value', 'field')) %>%
  group_by(df2_id, df1_id) %>% 
  filter(n()>=2) %>%  # where 2 is the minimum number of matches
  arrange(df2_id, df1_id, value) %>% 
  select(df2_id, df1_id, field, value)

# A tibble: 13 x 4
# Groups:   df2_id, df1_id [5]
   df2_id df1_id field value
    <int>  <int> <chr> <int>
 1      1      4 y         2
 2      1      4 w         9
 3      2      2 y         6
 4      2      2 z         7
 5      2      3 y         6
 6      2      3 z         7
 7      2      3 w        10
 8      3      1 x         4
 9      3      1 y         5
10      3      1 z         8
11      4      1 x         4
12      4      1 y         5
13      4      1 z         8

You can see that df2 row id 1 matches df1 row 4 on the fields y and w, df2 row 2 matches df1 row 2 on fields fields y and z, df2 row 2 also matches df1 row 3 on fields y, x, and w. df2 rows 3 and 4 match df1 row 1 on x, y, and z.

arrange and select are really only necessary for easier viewing of the data.

In R, discover rows which partially match rows in another data frame

4 Answers