Compare and merge two dataframes

Question

I have the following two dataframes in R:

df1 = data.frame(c("A", "A", "A", "B", "B"), c(1, 11, 21, 35, 45), c(6, 20, 30, 40, 60), c(1, 2, 3, 4, 5))
colnames(df1) = c("X", "Y", "Z", "score")

df1 
  X  Y  Z score
1 A  1  6     1
2 A 11 20     2
3 A 21 30     3
4 B 35 40     4
5 B 45 60     5

df2 = data.frame(c("A", "A", "A", "A", "B", "B", "B", "C"), c(1, 6, 21, 50, 20, 31, 50, 10), c(5, 20, 30, 60, 30, 40, 60, 20), c("x1", "x2", "x3", "x4", "x5", "x6", "x7", "x8"))
colnames(df2) = c("X", "Y", "Z", "out")

df2
  X  Y  Z out
1 A  1  5  x1
2 A  6 20  x2
3 A 21 30  x3
4 A 50 60  x4 
5 B 20 30  x5
6 B 31 40  x6
7 B 50 60  x7
8 C 10 20  x8

For every row in df1, I want to check:

is there a match with the value in 'X' and any other 'X' value from df2
if the above is true: I want to check if the values from 'Y' and 'Z' are in the range of the values 'Y' and 'Z' from df2
if both are true: then I want to add the value from 'out' to df1.

This is how the output should look like:

output = data.frame(c("A", "A", "A", "B", "B"), c(1, 11, 21, 35, 45), c(6, 20, 30, 40, 60), c(1, 2, 3, 4, 5), c("x1, x2", "x2", "x3", "x4", "x5"))
colnames(output) = c("X", "Y", "Z", "score", "out")

  X  Y  Z score    out
1 A  1  6     1 x1, x2
2 A 11 20     2     x2
3 A 21 30     3     x3
4 B 35 40     4     x6
5 B 45 60     5     x7

The original df1 is kept with an extra column 'out' that is added.

Line 1 from 'output', contains 'x1, x2' in column 'out'. Why: there is a match between the values in column 'X' and range 1 to 6 overlap with lines 1 and 2 from df2.

I've asked this question before (Compare values from two dataframes and merge) where it is suggested to use the foverlaps function. However because of the different columns between df1 and df2 and the extra rows in df2, I cannot make it work.

AntoniosK AntoniosK · Accepted Answer · 2017-02-16T12:59:24

library(dplyr)

df1 = data.frame(c("A", "A", "A", "B", "B"), c(1, 11, 21, 35, 45), 
                 c(6, 20, 30, 40, 60), c(1, 2, 3, 4, 5), stringsAsFactors = F)
colnames(df1) = c("X", "Y", "Z", "score")

df2 = data.frame(c("A", "A", "A", "A", "B", "B", "B", "C"), c(1, 6, 21, 50, 20, 31, 50, 10), 
                 c(5, 20, 30, 60, 30, 40, 60, 20), 
                 c("x1", "x2", "x3", "x4", "x5", "x6", "x7", "x8"), stringsAsFactors = F)
colnames(df2) = c("X", "Y", "Z", "out")


df1 %>%
  left_join(df2, by="X") %>%         # join on main column
  rowwise() %>%                      # for each row
  mutate(counter = sum(seq(Y.x, Z.x) %in% seq(Y.y, Z.y))) %>%   # get how many elements of those ranges overlap
  filter(counter > 0) %>%            # keep rows with overlap
  group_by(X, Y.x, Z.x, score) %>%   # for each combination of those columns
  summarise(out = paste(out, collapse=", ")) %>%                # combine out column
  ungroup() %>%
  rename(Y = Y.x,
         Z = Z.x)

# # A tibble: 5 × 5
#       X     Y     Z score    out
#    <chr> <dbl> <dbl> <dbl> <chr>
# 1     A     1     6     1 x1, x2
# 2     A    11    20     2     x2
# 3     A    21    30     3     x3
# 4     B    35    40     4     x6
# 5     B    45    60     5     x7

The above process is based on dplyr package and involves a join and some grouping and filtering. If your initial datasets (df1, df2) are extremely large then the join will create an even bigger dataset that will need some time to be created.

Also, note that this process works with character and not factor variables. The process might convert factor variables to character if it tries to join factor variables with different levels.

I'd suggest you run the chained commands step by step to see how it works and spot if I missed anything that might lead to bugs in the code.

Compare and merge two dataframes

3 Answers