2
votes

I am trying to apply a function by row using data.table with columns as arguments. I am currently using apply as suggested here

However, my data.table is 27 million rows with 7 columns so the apply operation takes a very long time when I run it recursively on many input files, the job takes up all available RAM (32Gb). It's likely that I am copying the data.table multiple times, though I'm not sure about that.

I would like help making this code more memory efficient given that each input file will be ~30 million rows by 7 columns and there are 30 input files to process. I am fairly sure that the lines using apply are slowing down the whole code so alternatives that are more memory efficient or use vectorized functions would probably be better options.

I've had a lot of trouble trying to write a vectorized function that takes in 4 columns as arguments and operates on a row by row basis, using data.table. The apply solution in my example code works but it's very slow. One alternative I tried is:

cols=c("C","T","A","G")
func1<-function(x)x[max1(x)]
datU[,high1a:=func1(cols),by=1:nrow(datU)]

but the first 6 rows of the datU data.table output look like this:

    Cycle   Tab ID  colA    colB    colC    colG    high1   high1a
1   0   45513   -233.781    -84.087 -3.141  3740.916    3740.916    colC
2   0   45513   -103.561    -347.382    2900.866    357.071 2900.866    colC
3   0   45513   153.383 4036.636    353.479 -42.736 4036.636    colC
4   0   45513   -147.941    28.994  4354.994    384.945 4354.994    colC
5   0   45513   -89.719 -504.643    1298.476    131.32  1298.476    colC
6   0   45513   -250.11 -30.862 1877.049    -184.772    1877.049    colC

Here is my code using apply that works (it produced the high1 column above), but is too slow and memory intensive:

#Get input files from top directory, searching through all subdirectories
    file_list <- list.files(pattern = "*.test.txt", recursive=TRUE, full.names=TRUE)

#Make a loop to recursively read files from subdirectories, determine highest and second highest values in specified columns, create new column with those values

    savelist=NULL
    for (i in file_list) {

    datU <- fread(i)
    name=dirname(i)

    #Compute highest and second highest for each row (cols 4,5,6,7) and the difference between highest and second highest values
    maxn <- function(n) function(x) order(x, decreasing = TRUE)[n]
    max1 <- maxn(1)
    max2 <- maxn(2)
    colNum=c(4,5,6,7)
    datU[,high1:=apply(datU[,colNum,with=FALSE],1,function(x)x[max1(x)])])
    datU[,high2:=apply(datU[,colNum,with=FALSE],1,function(x)x[max2(x)])]
    datU[,difference:=high1-high2,by=1:nrow(datU)]
    datU[,folder:=name]
    savelist[[i]]<-datU

}

#Create loop to iterate over folders and output data

sigout=NULL
for (i in savelist) {

   # Do some stuff to manipulate data frames, then merge them for output
setkey(i,Cycle,folder)
Sums1<-i[,sum(colA,colB,colC,colD),by=list(Cycle,folder)]
MeanTot<-Sums[,round(mean(V1),3),by=list(Cycle,folder)]
MeanTotsd<-Sums[,round(sd(V1),3),by=list(Cycle,folder)]
Meandiff<-i[,list(meandiff=mean(difference)),by=list(Cycle,folder)]
Meandiffsd<-i[,list(meandiff=sd(difference)),by=list(Cycle,folder)]

df1out<-merge(MeanTot,MeanTotsd,by=list(Cycle,folder))
df2out<-merge(Meandiff,Meandiffsd,by=list(Cycle,folder))
sigout<-merge(df1out,df2out)

#Output values 
write.table(sigout,"Sigout.txt",append=TRUE,quote=FALSE,sep=",",row.names=FALSE,col.names=TRUE)
}

I would love some examples concerning alternative functions to apply that will give me the highest and second highest values for each row for columns 4,5,6,7 which can be identified by index or alternatively by column name.

Thank you!

2

2 Answers

2
votes

You could do something like this:

DF <- read.table(text = "    Cycle   Tab ID  colA    colB    colC    colG    high1   high1a
1   0   45513   -233.781    -84.087 -3.141  3740.916    3740.916    colC
                 2   0   45513   -103.561    -347.382    2900.866    357.071 2900.866    colC
                 3   0   45513   153.383 4036.636    353.479 -42.736 4036.636    colC
                 4   0   45513   -147.941    28.994  4354.994    384.945 4354.994    colC
                 5   0   45513   -89.719 -504.643    1298.476    131.32  1298.476    colC
                 6   0   45513   -250.11 -30.862 1877.049    -184.772    1877.049    colC", header = TRUE)

library(data.table)
setDT(DF)

maxTwo <- function(x) {
  ind <- length(x) - (1:0) #the index is equal for all rows,
                           #so it could be made a function parameter
                           #for better efficiency
  as.list(sort.int(x, partial = ind)[ind]) #partial sorting
}

DF[, paste0("max", 1:2) := maxTwo(unlist(.SD)), 
    by = seq_len(nrow(DF)), .SDcols = 4:7]
DF[, diffMax := max2 - max1]

#   Cycle Tab    ID     colA     colB     colC     colG    high1 high1a    max1     max2  diffMax
#1:     1   0 45513 -233.781  -84.087   -3.141 3740.916 3740.916   colC  -3.141 3740.916 3744.057
#2:     2   0 45513 -103.561 -347.382 2900.866  357.071 2900.866   colC 357.071 2900.866 2543.795
#3:     3   0 45513  153.383 4036.636  353.479  -42.736 4036.636   colC 353.479 4036.636 3683.157
#4:     4   0 45513 -147.941   28.994 4354.994  384.945 4354.994   colC 384.945 4354.994 3970.049
#5:     5   0 45513  -89.719 -504.643 1298.476  131.320 1298.476   colC 131.320 1298.476 1167.156
#6:     6   0 45513 -250.110  -30.862 1877.049 -184.772 1877.049   colC -30.862 1877.049 1907.911

However, you'd still be looping over the rows, which means nrow calls to the function. You could try Rcpp to do the looping in compiled code.

2
votes

Depending on how you want to deal with duplicates, e.g. if you don't have them or want to group them together, you could do:

d = data.table(a = 1:4, b = 4:1, c = c(2,1,1,4))
#   a b c
#1: 1 4 2
#2: 2 3 1
#3: 3 2 1
#4: 4 1 4

high1 = do.call(pmax, d)
#[1] 4 3 3 4
high2 = do.call(pmax, d * (d != high1))
#[1] 2 2 2 1

Otherwise, you could just add some jitter out of the scope of your precision (I chose a large amount to keep it visible):

d.jitter = d + runif(nrow(d) * ncol(d), 0, 1e-4)
#          a        b        c
#1: 1.000044 4.000090 2.000008
#2: 2.000076 3.000029 1.000034
#3: 3.000007 2.000029 1.000036
#4: 4.000001 1.000069 4.000041

high1.j = do.call(pmax, d.jitter)
high2 = do.call(pmax, d * (d.jitter != high1.j))
#[1] 2 2 2 4

Translation to relevant .SD and .SDcols semantics is left as a simple exercise to the reader.