I have 2 R dataframes, each with more than a million rows, of genomic coordinates and need to find out which one overlap (by any amount). Therefore, I need to know the rows where df1.chr = df2.chr and where df1.start (or df1.end) is in between df2.start and df2.end or where df2.start (or df2.end) is in between df1.start and df1.end. If this can make the algorithm easier, I really only need to how many overlap, not necessarily which one.
Example dataframe 1
df1 <- data.frame("chr" = c("chr1", "chr1", "chr2", "chr3", "chr3"),
"start" = c(32826450, 49416884, 49417450, 49417523, 144385631),
"end" = c(32826456, 49416890, 49417456, 49417529, 144385637))
Example dataframe 2
df2 <- data.frame("chr" = c("chr1", "chr1", "chr2"),
"start" = c(32826455, 45259551, 49417440),
"end" = c(32826458, 45259557, 49417453))
Using sqldf I could easily obtain the answer with
sqldf("SELECT *
FROM df1
JOIN df2
ON
df1.chr = df2.chr and
((df1.start BETWEEN df2.start and df2.end or
df1.end BETWEEN df2.start and df2.end) or
(df2.start BETWEEN df1.start and df1.end or
df2.end BETWEEN df1.start and df1.end))")
to get
chr start end chr..4 start..5 end..6
1 chr1 32826450 32826456 chr1 32826455 32826458
2 chr2 49417450 49417456 chr2 49417440 49417453
But this take hours to complete.
Is there a more efficient way to proceed? I tried looking at both dplyr and data.table but could never get all parts to work.
Thanks!