I am trying to apply a function by row using data.table with columns as arguments. I am currently using apply as suggested here
However, my data.table is 27 million rows with 7 columns so the apply operation takes a very long time when I run it recursively on many input files, the job takes up all available RAM (32Gb). It's likely that I am copying the data.table multiple times, though I'm not sure about that.
I would like help making this code more memory efficient given that each input file will be ~30 million rows by 7 columns and there are 30 input files to process. I am fairly sure that the lines using apply are slowing down the whole code so alternatives that are more memory efficient or use vectorized functions would probably be better options.
I've had a lot of trouble trying to write a vectorized function that takes in 4 columns as arguments and operates on a row by row basis, using data.table. The apply solution in my example code works but it's very slow. One alternative I tried is:
cols=c("C","T","A","G")
func1<-function(x)x[max1(x)]
datU[,high1a:=func1(cols),by=1:nrow(datU)]
but the first 6 rows of the datU data.table output look like this:
Cycle Tab ID colA colB colC colG high1 high1a
1 0 45513 -233.781 -84.087 -3.141 3740.916 3740.916 colC
2 0 45513 -103.561 -347.382 2900.866 357.071 2900.866 colC
3 0 45513 153.383 4036.636 353.479 -42.736 4036.636 colC
4 0 45513 -147.941 28.994 4354.994 384.945 4354.994 colC
5 0 45513 -89.719 -504.643 1298.476 131.32 1298.476 colC
6 0 45513 -250.11 -30.862 1877.049 -184.772 1877.049 colC
Here is my code using apply that works (it produced the high1 column above), but is too slow and memory intensive:
#Get input files from top directory, searching through all subdirectories
file_list <- list.files(pattern = "*.test.txt", recursive=TRUE, full.names=TRUE)
#Make a loop to recursively read files from subdirectories, determine highest and second highest values in specified columns, create new column with those values
savelist=NULL
for (i in file_list) {
datU <- fread(i)
name=dirname(i)
#Compute highest and second highest for each row (cols 4,5,6,7) and the difference between highest and second highest values
maxn <- function(n) function(x) order(x, decreasing = TRUE)[n]
max1 <- maxn(1)
max2 <- maxn(2)
colNum=c(4,5,6,7)
datU[,high1:=apply(datU[,colNum,with=FALSE],1,function(x)x[max1(x)])])
datU[,high2:=apply(datU[,colNum,with=FALSE],1,function(x)x[max2(x)])]
datU[,difference:=high1-high2,by=1:nrow(datU)]
datU[,folder:=name]
savelist[[i]]<-datU
}
#Create loop to iterate over folders and output data
sigout=NULL
for (i in savelist) {
# Do some stuff to manipulate data frames, then merge them for output
setkey(i,Cycle,folder)
Sums1<-i[,sum(colA,colB,colC,colD),by=list(Cycle,folder)]
MeanTot<-Sums[,round(mean(V1),3),by=list(Cycle,folder)]
MeanTotsd<-Sums[,round(sd(V1),3),by=list(Cycle,folder)]
Meandiff<-i[,list(meandiff=mean(difference)),by=list(Cycle,folder)]
Meandiffsd<-i[,list(meandiff=sd(difference)),by=list(Cycle,folder)]
df1out<-merge(MeanTot,MeanTotsd,by=list(Cycle,folder))
df2out<-merge(Meandiff,Meandiffsd,by=list(Cycle,folder))
sigout<-merge(df1out,df2out)
#Output values
write.table(sigout,"Sigout.txt",append=TRUE,quote=FALSE,sep=",",row.names=FALSE,col.names=TRUE)
}
I would love some examples concerning alternative functions to apply that will give me the highest and second highest values for each row for columns 4,5,6,7 which can be identified by index or alternatively by column name.
Thank you!