Matching rows of two dataframes based on multiple conditions

Question

For example, if I have two data frames df1 and df2, which both have 3 columns with column names c("Name", "Region", "State") and look like following:

df1:

Name  Region             State
A     Boston City        Mass
B     Washington D.C.    NA
C     New York           NY

df2:

Name  Region             Job
C     Boston             Massachusetts
B     D.C.               NA
A     Boston             Massachusetts
D     Dallas             Texas

The thing I want to do is to see whether the row in df1 has a corresponding row in df2 by precisely matching "name" but only partially matching "Region" and "Job". For example, "Boston City" and "Boston" should be treated as same (same for "Washington D.C." and "D.C.") and "Mass" and "Massachusetts" should be treated same as well. Thus, rows with name "A" and "B" in df1 should successfully match for the corresponding rows with name "A" and "B" in df2. What I want for outcome is a data frame with two columns. Column 1 is the row indexes of df1 that has a match in df2 and column 2 is the row indexes of corresponding matched rows in df2. That is,

Column1  Column2
1        3
2        2

I know I can probably use grep to do partial matching? But I still don't know how to achieve my objective. Thanks!

Ronak Shah Ronak Shah · Accepted Answer · 2020-04-10T05:25:37

We can create a row index for each of the data frame and then full_join based on Name column. We then select only those rows where Region partially matches and Job and State partially match. We also select rows when both columns have NA values.

library(dplyr)
library(stringr)

df1 %>%
  mutate(Column1=row_number()) %>%
  full_join(df2 %>% mutate(Column2 = row_number()), by = 'Name') %>%
  filter((str_detect(Region.x, fixed(Region.y)) | 
                   (is.na(Region.x) & is.na(Region.y))) & 
       (str_detect(Job, fixed(State)) | (is.na(Job) & is.na(State)))) %>%
   select(Column1, Column2)

#  Column1 Column2
#1       1       3
#2       2       2

data

df1 <-  structure(list(Name = c("A", "B", "C"), Region = c("BostonCity", 
"WashingtonD.C.", "NewYork"), State = c("Mass", NA, "NY")), row.names = c(NA, 
-3L), class = "data.frame")

df2 <- structure(list(Name = c("C", "B", "A", "D"), Region = c("Boston", 
"D.C.", "Boston", "Dallas"), Job = c("Massachusetts", NA, "Massachusetts", 
"Texas")), row.names = c(NA, -4L), class = "data.frame")

Matching rows of two dataframes based on multiple conditions

1 Answers