1
votes

I'm working with two dataframes in R:

df1 = data.frame(c("A", "B"), c(1, 21), c(17, 29))
colnames(df1) = c("location", "start", "stop")

df1
location    start    stop
A           1        17
B           21       29

df2 = data.frame(c("A", "A", "A", "A", "B"), c(1, 10, 20, 40, 20), c(10, 20, 30, 50, 30), c("x1", "x2","x4", "x5", "x3"))
colnames(df2) = c("location", "start", "stop", "out")

df2
location    start    stop   out
A           1        10     x1
A           10       20     x2
A           20       30     x4
A           40       50     x5  
B           20       30     x3

Now I want to check for each row of df1:

  • is there a match between 'location' with a 'location' from df2
  • if the 'start' value is in the range of start and stop from df2 or if the 'end' value is in the range of start and stop from df2, then the corresponding 'out' value from df2 should be pasted in a new column in df1

This is how the output would look in the case of this example

df1_new

location    start    stop    out
A           1        17      x1,x2
B           21       29      x3

I've started in R, but I'm stuck at the point where I need to look in the complete dataframe of df2

for (i in nrow(df1)) {
   if(df1$location[i] == df2$location # it needs to look for a match in the complete dataframe of df2. I don't know how to do this
   & if (df1$start[i] %in% # it needs to check if the start value lies in the range between df2$start & df2$end
}
2
Your edit is very incomplete, you added rows to the printout, but not to the code making it. You also neglected to update the output.Frank
@Frank: I've forgotten to change the code, this is done now. The output stays the sameuser1987607
Sorry, no, your input now contains an error with quote marks. Try running your code before posting it.Frank

2 Answers

2
votes

Here's a data.table way, using foverlaps:

library(data.table)
setkey(setDT(df1))
setDT(df2, key = names(df1))

foverlaps(df1, df2)[, .(out = toString(out)), by=location]

#    location    out
# 1:        A x1, x2
# 2:        B     x3

You can get other cols out of the foverlaps results if desired:

foverlaps(df1, df2)
#    location start stop out i.start i.stop
# 1:        A     1   10  x1       1     17
# 2:        A    10   20  x2       1     17
# 3:        B    20   30  x3      21     29
2
votes

You need to aggregate first and then merge, i.e.

merge(df1, aggregate(out ~ location, df2, toString), by = 'location')

#  location start stop    out
#1        A     1   17 x1, x2
#2        B    21   29     x3