1
votes

I have 2 R dataframes, each with more than a million rows, of genomic coordinates and need to find out which one overlap (by any amount). Therefore, I need to know the rows where df1.chr = df2.chr and where df1.start (or df1.end) is in between df2.start and df2.end or where df2.start (or df2.end) is in between df1.start and df1.end. If this can make the algorithm easier, I really only need to how many overlap, not necessarily which one.

Example dataframe 1

df1 <- data.frame("chr" = c("chr1", "chr1", "chr2", "chr3", "chr3"), 
                  "start" = c(32826450, 49416884, 49417450, 49417523, 144385631), 
                  "end" = c(32826456, 49416890, 49417456, 49417529, 144385637))

Example dataframe 2

df2 <- data.frame("chr" = c("chr1", "chr1", "chr2"), 
                  "start" = c(32826455, 45259551, 49417440), 
                  "end" = c(32826458, 45259557, 49417453))

Using sqldf I could easily obtain the answer with

sqldf("SELECT *
      FROM df1
      JOIN df2
      ON
      df1.chr = df2.chr and
      ((df1.start BETWEEN df2.start and df2.end or
        df1.end BETWEEN df2.start and df2.end) or
       (df2.start BETWEEN df1.start and df1.end or
        df2.end BETWEEN df1.start and df1.end))")

to get

   chr    start      end chr..4 start..5   end..6
1 chr1 32826450 32826456   chr1 32826455 32826458
2 chr2 49417450 49417456   chr2 49417440 49417453

But this take hours to complete.

Is there a more efficient way to proceed? I tried looking at both dplyr and data.table but could never get all parts to work.

Thanks!

2
You only need the df1.start between ... or df2.start between ... conditions since if neither start is within the other's range they must be disjoint.G. Grothendieck

2 Answers

1
votes

It may be much easier to use the bioconductor library GenomicRanges for this task.

df1 <- makeGRangesFromDataFrame(df1)
df2 <- makeGRangesFromDataFrame(df2)

> sum(countOverlaps(df1, df2))
[1] 2

HelloRanges is a great resource on how to perform common tasks on GRanges.

0
votes

Maybe something like:

library(data.table)
setDT(df1)
setDT(df2)
cols <- c("chr","i.start","i.end","x.start","x.end")
unique(rbindlist(list(
    df2[df1, on=.(chr, start<=start, end>=start), nomatch=0L, mget(cols)],
    df2[df1, on=.(chr, start<=end, end>=end), nomatch=0L, mget(cols)],
    df2[df1, on=.(chr, start<=start, end>=start), nomatch=0L, mget(cols)],
    df2[df1, on=.(chr, start<=end, end>=end), nomatch=0L, mget(cols)]
)))

output:

    chr  i.start    i.end  x.start    x.end
1: chr2 49417450 49417456 49417440 49417453
2: chr1 32826450 32826456 32826455 32826458