0
votes

I got two big data frames(csv format), one (df1) has this structure

chromName fragStart fragEnd fragLength leftFragEndLength rightFragEndLength
   Chr1     176         377            202          202                202
   Chr1     472         746            275          275                275
   Chr1     1276        1382            107         107                107
   Chr1     1581        1761            181         173                  4
   Chr1     1890        2080            191          93                 71

The other (df2) includes the results for 5'target_id_start 5'target_id_end and 3'target_id_start,3'target_id_end together and it looks like this

    Chr target_id_start target_id_end tot_counts uniq_counts est_counts
1  Chr1        10000016      10000066          0           0          0
2  Chr1        10000062      10000112          0           0          0
3  Chr1        10000171      10000221          0           0          0
4  Chr1        10000347      10000397          0           0          0
5  Chr1         1000041       1000091          0           0          0

what I'm trying to do is to check if the column target_id_start and target_id_end is between or equal with the columns fragStart and fragEnd. If this is true then i want to write the columns tot_counts uniq_counts est_counts in the first file df1. This will be true for 5'target_id_start 5'target_id_end and 3'target_id_start,3'target_id_end and the result to be like that

chromName fragStart fragEnd fragLength leftFragEndLength rightFragEndLength tot_counts5' uniq_counts5' est_counts5' tot_counts3' uniq_counts3' est_counts3'
    Chr1     176         377            202          202                202            0           0          0            0           0          0 
    Chr1     472         746            275          275                275            0           0          0            0           0          0
    Chr1     1276        1382            107         107                107            0           0          0            0           0          0
    Chr1     1581        1761            181         173                  4            0           0          0            0           0          0
    Chr1     1890        2080            191          93                 71            0           0          0            0           0          0

Do you know any good way to do this in R ? Thank you very much.

1
You may need to check findOverlaps from library(IRanges) or foverlaps from library(data.table) May be this link give some ideas stackoverflow.com/questions/27619381/… or stackoverflow.com/questions/19748535/…akrun
Both DFs are of equal size and each row in one corresponds to the row in other?statespace
No the DFs are not of equal size and each row doesn't correspond to the row in other!user3683485
Perhaps you should simplify your example. Is the question to search if some interval of two numerics in df1 matches the interval of two numerics in df2? And then on every match write values from df2 to df1? Clutter of long column names kinda makes it lose the essence of question.statespace
its something like that. If for example fragStart<=target_id_start<=fragEnd or the same for target_id_end then on every match write values from df2 to df1. The number and names doesn’t help you are right. generally i want the numerics from df1 to be part of the two numerics of df2 or matches the intervals of the two numerics in df2. when this is true write values from df2 to df1user3683485

1 Answers

0
votes

Even though I really hate loops, the best I can offer is:

a <- data.frame(x = c(1,10,100), y = c(2, 20, 200))
b <- data.frame(x = c(1.5, 30, 90, 150), y = c(1.6, 50, 101, 170), z = c("a","b","c", "d"))

a$z <= NA

for(i in 1:length(a$x)){
  temp <- which((b$x >= a$x[i] & b$x <= a$y[i]) | (b$y >= a$x[i] & b$y <= a$y[i]))
  a$z[i] <- ifelse(length(temp) > 0, temp, NA) 
}

As an example - loop writes row index of data frame b where interval in a corresponds to interval in b. Further on you can write a loop where it takes these row indices and writes corresponding values to some other column.

This might give you some idea. But this is not efficient on large data sets. Hope it inspires you to proper solution. Not a workaround such as mine.