6
votes

I have one table with coordinates (start, end) of ca. 500000 fragments and another table with 60000 single coordinates that I would like to match with the former fragments. I.e., for each record from dtCoords table I need to search a record in dtFrags table having the same chr and start<=coord<=end (and retrieve the type from this record of dtFrags). Is it good idea at all to use R for this, or I should rather look to other languages?

Here is my example:

require(data.table)

dtFrags <- fread(
  "id,chr,start,end,type
 1,1,100,200,exon
 2,2,300,500,intron
 3,X,400,600,intron
 4,2,250,600,exon
")

dtCoords <- fread(
"id,chr,coord
 10,1,150
 20,2,300
 30,Y,500
")

At the end, I would like to have something like this:

"idC,chr,coord,idF,type
 10,  1,  150,  1, exon
 20,  2,  300,  2, intron
 20,  2,  300,  4, exon
 30,  Y,  500, NA, NA
"

I can simplify a bit the task by splitting the table to subtables by chr, so I would concentrate only on coordinates

setkey(dtCoords, 'chr')
setkey(dtFrags,  'chr')

for (chr in unique(dtCoords$chr)) {
  dtCoordsSub <- dtCoords[chr];
  dtFragsSub  <-  dtFrags[chr];
  dtCoordsSub[, {
    # ????  
  }, by=id]  
}

but it's still not clear for me how should I work inside... I would be very grateful for any hints.

UPD. just in case, I put my real table in the archive here. After unpacking to your working directory, tables can be loaded with the following code:

dtCoords <- fread("dtCoords.txt", sep="\t", header=TRUE)
dtFrags  <- fread("dtFrags.txt",  sep="\t", header=TRUE)
2
I do not have a way to run R right now, but this problems seems interesting to me. and I think first divide data by chromosome and then I hope you do not have duplicate coord in the range, then make coord column based on start and position.then merge later with that coord. I will get back to the problem tomorrow, if it is not solved by that timeAnanta

2 Answers

7
votes

In general, it's very appropriate to use the bioconductor package IRanges to deal with problems related to intervals. It does so efficiently by implementing interval tree. GenomicRanges is another package that builds on top of IRanges, specifically for handling, well, "Genomic Ranges".

require(GenomicRanges)
gr1 = with(dtFrags, GRanges(Rle(factor(chr, 
          levels=c("1", "2", "X", "Y"))), IRanges(start, end)))
gr2 = with(dtCoords, GRanges(Rle(factor(chr, 
          levels=c("1", "2", "X", "Y"))), IRanges(coord, coord)))
olaps = findOverlaps(gr2, gr1)
dtCoords[, grp := seq_len(nrow(dtCoords))]
dtFrags[subjectHits(olaps), grp := queryHits(olaps)]
setkey(dtCoords, grp)
setkey(dtFrags, grp)
dtFrags[, list(grp, id, type)][dtCoords]

   grp id   type id.1 chr coord
1:   1  1   exon   10   1   150
2:   2  2 intron   20   2   300
3:   2  4   exon   20   2   300
4:   3 NA     NA   30   Y   500
3
votes

Does this work? You can use merge first and then subset

   kk<-merge(dtFrags,dtCoords,by="chr",all.x=TRUE)
> kk
   chr id.x start end   type id.y coord
1:   1    1   100 200   exon   10   150
2:   2    2   300 500 intron   20   300
3:   2    4   250 600   exon   20   300
4:   X    3   400 600 intron   NA    NA


 kk[coord>=start & coord<=end]
   chr id.x start end type id.y coord
1:   1    1   100 200 exon   10   150
2:   2    4   250 600 exon   20   300