I have one table with coordinates (start
, end
) of ca. 500000 fragments and another table with 60000 single coordinates that I would like to match with the former fragments. I.e., for each record from dtCoords
table I need to search a record in dtFrags
table having the same chr
and start
<=coord
<=end
(and retrieve the type
from this record of dtFrags
). Is it good idea at all to use R for this, or I should rather look to other languages?
Here is my example:
require(data.table)
dtFrags <- fread(
"id,chr,start,end,type
1,1,100,200,exon
2,2,300,500,intron
3,X,400,600,intron
4,2,250,600,exon
")
dtCoords <- fread(
"id,chr,coord
10,1,150
20,2,300
30,Y,500
")
At the end, I would like to have something like this:
"idC,chr,coord,idF,type
10, 1, 150, 1, exon
20, 2, 300, 2, intron
20, 2, 300, 4, exon
30, Y, 500, NA, NA
"
I can simplify a bit the task by splitting the table to subtables by chr
, so I would concentrate only on coordinates
setkey(dtCoords, 'chr')
setkey(dtFrags, 'chr')
for (chr in unique(dtCoords$chr)) {
dtCoordsSub <- dtCoords[chr];
dtFragsSub <- dtFrags[chr];
dtCoordsSub[, {
# ????
}, by=id]
}
but it's still not clear for me how should I work inside... I would be very grateful for any hints.
UPD. just in case, I put my real table in the archive here. After unpacking to your working directory, tables can be loaded with the following code:
dtCoords <- fread("dtCoords.txt", sep="\t", header=TRUE)
dtFrags <- fread("dtFrags.txt", sep="\t", header=TRUE)