1
votes

I have the following two dataframes in R:

df1 = data.frame(c("A", "A", "A", "B", "B"), c(1, 11, 21, 35, 45), c(6, 20, 30, 40, 60), c(1, 2, 3, 4, 5))
colnames(df1) = c("X", "Y", "Z", "score")

df1 
  X  Y  Z score
1 A  1  6     1
2 A 11 20     2
3 A 21 30     3
4 B 35 40     4
5 B 45 60     5

df2 = data.frame(c("A", "A", "A", "A", "B", "B", "B", "C"), c(1, 6, 21, 50, 20, 31, 50, 10), c(5, 20, 30, 60, 30, 40, 60, 20), c("x1", "x2", "x3", "x4", "x5", "x6", "x7", "x8"))
colnames(df2) = c("X", "Y", "Z", "out")

df2
  X  Y  Z out
1 A  1  5  x1
2 A  6 20  x2
3 A 21 30  x3
4 A 50 60  x4 
5 B 20 30  x5
6 B 31 40  x6
7 B 50 60  x7
8 C 10 20  x8

For every row in df1, I want to check:

  • is there a match with the value in 'X' and any other 'X' value from df2
  • if the above is true: I want to check if the values from 'Y' and 'Z' are in the range of the values 'Y' and 'Z' from df2
  • if both are true: then I want to add the value from 'out' to df1.

This is how the output should look like:

output = data.frame(c("A", "A", "A", "B", "B"), c(1, 11, 21, 35, 45), c(6, 20, 30, 40, 60), c(1, 2, 3, 4, 5), c("x1, x2", "x2", "x3", "x4", "x5"))
colnames(output) = c("X", "Y", "Z", "score", "out")

  X  Y  Z score    out
1 A  1  6     1 x1, x2
2 A 11 20     2     x2
3 A 21 30     3     x3
4 B 35 40     4     x6
5 B 45 60     5     x7

The original df1 is kept with an extra column 'out' that is added.

Line 1 from 'output', contains 'x1, x2' in column 'out'. Why: there is a match between the values in column 'X' and range 1 to 6 overlap with lines 1 and 2 from df2.

I've asked this question before (Compare values from two dataframes and merge) where it is suggested to use the foverlaps function. However because of the different columns between df1 and df2 and the extra rows in df2, I cannot make it work.

3

3 Answers

1
votes
library(dplyr)

df1 = data.frame(c("A", "A", "A", "B", "B"), c(1, 11, 21, 35, 45), 
                 c(6, 20, 30, 40, 60), c(1, 2, 3, 4, 5), stringsAsFactors = F)
colnames(df1) = c("X", "Y", "Z", "score")

df2 = data.frame(c("A", "A", "A", "A", "B", "B", "B", "C"), c(1, 6, 21, 50, 20, 31, 50, 10), 
                 c(5, 20, 30, 60, 30, 40, 60, 20), 
                 c("x1", "x2", "x3", "x4", "x5", "x6", "x7", "x8"), stringsAsFactors = F)
colnames(df2) = c("X", "Y", "Z", "out")


df1 %>%
  left_join(df2, by="X") %>%         # join on main column
  rowwise() %>%                      # for each row
  mutate(counter = sum(seq(Y.x, Z.x) %in% seq(Y.y, Z.y))) %>%   # get how many elements of those ranges overlap
  filter(counter > 0) %>%            # keep rows with overlap
  group_by(X, Y.x, Z.x, score) %>%   # for each combination of those columns
  summarise(out = paste(out, collapse=", ")) %>%                # combine out column
  ungroup() %>%
  rename(Y = Y.x,
         Z = Z.x)

# # A tibble: 5 × 5
#       X     Y     Z score    out
#    <chr> <dbl> <dbl> <dbl> <chr>
# 1     A     1     6     1 x1, x2
# 2     A    11    20     2     x2
# 3     A    21    30     3     x3
# 4     B    35    40     4     x6
# 5     B    45    60     5     x7

The above process is based on dplyr package and involves a join and some grouping and filtering. If your initial datasets (df1, df2) are extremely large then the join will create an even bigger dataset that will need some time to be created.

Also, note that this process works with character and not factor variables. The process might convert factor variables to character if it tries to join factor variables with different levels.

I'd suggest you run the chained commands step by step to see how it works and spot if I missed anything that might lead to bugs in the code.

2
votes

Here are two possible ways, a) using the newly implemented non equi joins feature, and b) foverlaps as you'd specifically mentioned that..

a) non-equi joins

dt2[dt1, on=.(X, Z>=Y, Y<=Z), 
      .(score, out=paste(out, collapse=",")), 
    by=.EACHI]

where dt1 and dt2 are data.tables corresponding to df1 and df2. Note that you'll have to revert column names Z and Y in the result (since the column names come from dt2 but the values from dt1.

Matching rows from dt2 corresponding to each row is dt1 is found based on the condition provided to the on argument and .() is evaluated for each of those matching rows (because of by=.EACHI).

b) foverlaps

setkey(dt1, X, Y, Z)
olaps <- foverlaps(dt2, dt1, type="any", nomatch=0L)
olaps[, .(score=score[1L], out=paste(out, collapse=",")), by=.(X,Y,Z)]
0
votes

Here is another options using sqldf

library(sqldf)
xx=sqldf('select t1.*,t2.out from df1 t1 left join df2 t2 on t1.X=t2.X and ((t2.Y between t1.Y and t1.Z) or (t2.Z between t1.Y and t1.Z))')
aggregate(xx[ncol(xx)], xx[-ncol(xx)], FUN = function(X) paste(unique(X), collapse=", "))