I would like to return a matrix/data.frame each row containing arguments and the content of a file.
However, there may be many files, so I would prefer if I could load the file lazily, so the file is only read if the actual content is requested. The function below loads the files actively if as.func=F
.
It would be perfect if it could load them lazily, but it would also be acceptable, if instead of the content a function is returned that would read the content.
I can make functions that read the content (see below with as.func=T
), but for some reason I cannot put that into the data.frame to return.
load_parallel_results <- function(resdir,as.func=F) {
## Find files called .../stdout
stdoutnames <- list.files(path=resdir, pattern="stdout", recursive=T);
## Find files called .../stderr
stderrnames <- list.files(path=resdir, pattern="stderr", recursive=T);
if(as.func) {
## Create functions to read them
stdoutcontents <-
lapply(stdoutnames, function(x) { force(x); return(function() { return(paste(readLines(paste(resdir,x,sep="/")),collapse="\n")) } ) } );
stderrcontents <-
lapply(stderrnames, function(x) { force(x); return(function() { return(paste(readLines(paste(resdir,x,sep="/")),collapse="\n")) } ) } );
} else {
## Read them
stdoutcontents <-
lapply(stdoutnames, function(x) { return(paste(readLines(paste(resdir,x,sep="/")),collapse="\n")) } );
stderrcontents <-
lapply(stderrnames, function(x) { return(paste(readLines(paste(resdir,x,sep="/")),collapse="\n")) } );
}
if(length(stdoutnames) == 0) {
## Return empty data frame if no files found
return(data.frame());
}
## Make the columns containing the variable values
m <- matrix(unlist(strsplit(stdoutnames, "/")),nrow = length(stdoutnames),byrow=T);
mm <- as.data.frame(m[,c(F,T)]);
## Append the stdout and stderr column
mmm <- cbind(mm,unlist(stdoutcontents),unlist(stderrcontents));
colnames(mmm) <- c(strsplit(stdoutnames[1],"/")[[1]][c(T,F)],"stderr");
## Example:
## parallel --results my/res/dir --header : 'echo {};seq {myvar1}' ::: myvar1 1 2 ::: myvar2 A B
## > load_parallel_results("my/res/dir")
## myvar1 myvar2 stdout stderr
## [1,] "1" "A" "1 A\n1" ""
## [2,] "1" "B" "1 B\n1" ""
## [3,] "2" "A" "2 A\n1\n2" ""
## [4,] "2" "B" "2 B\n1\n2" ""
return(mmm);
}
Background
GNU Parallel has a --results option that stores output in a structured way. If there are 1000000 outputfiles it may be hard to manage them. R is good for that, but it would be awfully slow if you had to read all 1000000 files just to get the ones where argument 1 = "Foo" and argument 2 = "Bar".
foo <- matrix(list(mean, sd,
+, list), 2,2)
. But I don't think you would ever need this except in a very special case. Why can't you hold your file names in the matrix or data frame and use them when necessary? - lebatsnokstat()
calls etc pp. Maybe intermediating with a queue is easier? I had good luck with redis and R access via the rredis package. - Dirk Eddelbuettel