Collapse intersecting regions

Question

I am trying to find a way to collapse rows with intersecting ranges, denoted by "start" and "stop" columns, and record the collapsed values into new columns. For example I have this data frame:

my.df<- data.frame(chrom=c(1,1,1,1,14,16,16), name=c("a","b","c","d","e","f","g"), start=as.numeric(c(0,70001,70203,70060, 40004, 50000872, 50000872)), stop=as.numeric(c(71200,71200,80001,71051, 42004, 50000890, 51000952)))


chrom name  start   stop
 1    a        0    71200
 1    b    70001    71200
 1    c    70203    80001
 1    d    70060    71051
14    e    40004    42004
16    f 50000872 50000890
16    g 50000872 51000952

And I am trying to find the overlapping ranges and record the biggest range covered by the collapsed overlapping rows in "start" and "stop" and the names of the collapsed rows, so I would get this:

chrom start   stop      name
 1    70001    80001    a,b,c,d
14    40004    42004    e
16    50000872 51000952 f,g

I think I could use the packages IRanges like this:

library(IRanges)
ranges <- split(IRanges(my.df$start, my.df$stop), my.df$chrom)

But then I have trouble getting the collapsed columns: I have tried with findOvarlaps but this

ov <- findOverlaps(ranges, ranges, type="any")

but I don't think this is right.

Any help would be extremely appreciated.

I edited the text to reflect the data better by adding the first position at start 0. With either of the approaches suggested chrom 14 is not correctly grouped, can you please help me understand why? Thank you! — user971102

agstudy agstudy · Accepted Answer · 2013-06-06T09:19:06

IRanges is a good candidate for such job. No need to use chrom variable.

ir <- IRanges(my.df$start, my.df$stop)
## I create a new grouping variable Note the use of reduce here(performance issue)
my.df$group2 <- subjectHits(findOverlaps(ir, reduce(ir)))
# chrom name    start     stop group2
# 1     1    a    70001    71200      2
# 2     1    b    70203    80001      2
# 3     1    c    70060    71051      2
# 4    14    d    40004    42004      1
# 5    16    e 50000872 50000890      3
# 6    16    f 50000872 51000952      3

The new group2 variable is the range indicator. Now using data.table I can't transform my data to the desired output:

library(data.table)
DT <- as.data.table(my.df)
DT[, list(start=min(start),stop=max(stop),
         name=list(name),chrom=unique(chrom)),
               by=group2]

# group2    start     stop  name chrom
# 1:      2    70001    80001 a,b,c     1
# 2:      1    40004    42004     d    14
# 3:      3 50000872 51000952   e,f    16

PS: the collapsed variable name here is not string but a list of factor. This is more efficient and easier to access than a collapased character using paste for example.

EDIT after OP clarification, I will create the group variable by chrom. I mean the Iranges code now is called for each chrom group. I slightly modify your data, to create group of intervals the same chromosome.

my.df<- data.frame(chrom=c(1,1,1,1,14,16,16), 
                   name=c("a","b","c","d","e","f","g"),
                   start=as.numeric(c(0,3000,70203,70060, 40004, 50000872, 50000872)), 
                   stop=as.numeric(c(1,5000,80001,71051, 42004, 50000890, 51000952)))

library(data.table)
DT <- as.data.table(my.df)

## find interval for each chromsom
DT[,group := { 
      ir <-  IRanges(start, stop);
       subjectHits(findOverlaps(ir, reduce(ir)))
      },by=chrom]

## Now I group by group and chrom 
DT[, list(start=min(start),stop=max(stop),name=list(name),chrom=unique(chrom)),
   by=list(group,chrom)]

  group chrom    start     stop name chrom
1:     1     1        0        1    a     1
2:     2     1     3000     5000    b     1
3:     3     1    70060    80001  c,d     1
4:     1    14    40004    42004    e    14
5:     1    16 50000872 51000952  f,g    16

Collapse intersecting regions

2 Answers