1
votes

I have a dataframe "dfA" (65,000 rows) of the form:

Chr Pos     NCP     NCP_Ratio
1   72      1.06    0.599
1   371     4.26    1.331
1   633     2.10    2.442
1   859     1.62    1.276
1   1032    7.62    4.563
1   1199    6.12    4.896
1   1340    13.22   23.607

I wish to use the values of Chr and Pos in each row of dfA to sequentially subset a second data.frame dfB of the form:

Chr Pos Watson  Crick
1   1   5       0
1   2   5       0
1   4   1       0
1   6   1       0
1   7   1       0
1   8   2       0
1   9   2       0
1   12  1       0
1   14  1       0
1   15  2       0
1   22  1       0

dfB has about 4 million rows.

Each time I subset dfB, I'd like to retrieve the values for a region of interest based on the range in Pos (i.e. +/- 1000 the value of Pos in dfA), and add them to a third data.frame dfC that is initially prefilled with zeros.

I have this working by looping through each row of dfA. But due to the 65,000 rows, it takes hours. So my questions are:

  1. Is there a better/more efficient way?

  2. Which part of my code is slowing this down so terribly?"

My code:

temp=NULL
width=300 # Region upstream and downstream of centrepoint #
padding=50 # Add some padding area to table #
width1=width+padding
dfC=data.frame(NULL)
dfC[1:((width1*2)+1),"Pos"]=(1:((width1*2)+1)) # Create Pos column #

# Prefill dfC table with zeros #
dfC[1:((width1*2)+1),"Watson"]=0
dfC[1:((width1*2)+1),"Crick"]=0

for (chrom in 1:16) { # LOOP1. Specify which chromosomes to process #

  dfB.1=subset(dfB,Chr==chrom) # Make temp copy of the dataframes for each chromosome #
  dfA.1=subset(dfA, Chr==chrom)

for (i in 1:nrow(dfA.1)) { # LOOP2: For each row in dfA:

  temp=subset(dfB.1, Pos>=(dfA.1[i,"Pos"]-width1) & Pos<=(dfA.1[i,"Pos"]+width1)) # Create temp matrix with hits in this region
  temp$Pos=temp$Pos-dfA.1[i,"Pos"]+width1+1
  dfC[temp$Pos,"Watson"]=dfC[temp$Pos,"Watson"]+temp[,"Watson"]
  dfC[temp$Pos,"Crick"]=dfC[temp$Pos,"Crick"]+temp[,"Crick"]

} # End of LOOP2 #
} # End of LOOP1 #

Example output is in the following form - where Pos contains values of 1 to 2000 (representing the region of -1000 to +1000 flanking each central Pos position in dfA), and the Watson/Crick columns contain the sum of the hits for each location.

Pos Watson  Crick
1   15      34
2   35      32
3   11      26
4   19      52
5   10      23
6   32      17
7   21      6
8   15      38
9   17      68
10  28      54
11  27      35
etc
1
It would be helpful if you include the expected output for the sample code. Also, consider posting you data in the form of dput - Sumedh
Thanks for the help with the editing/formatting. I adding the example output. I'm not familiar with dput, but I am reading help now. - Matt

1 Answers

0
votes

I only cleaned up your code, so don't expect a great improvement, but I think this version might run marginally faster.

width <- 300
padding <- 50
width1 <- width + padding    
dfC <- data.frame(Pos=1:((width1*2)+1), Watson=0, Crick=0)
for (chrom in 1:16) {
    dfB1 <- subset(dfB, Chr == chrom)
    for (pos in dfA$Pos[dfA$Chr == chrom]) {
        dfB2 <- dfB1[(dfB1$Pos >= pos - width1) & (dfB1$Pos <= pos + width1), ]
        rows <- dfB2$Pos - pos + width1 + 1
        dfC$Watson[rows] <- dfC$Watson[rows] + dfB2$Watson
        dfC$Crick[rows] <- dfC$Crick[rows] + dfB2$Crick
    }
}