6
votes

I am trying to run something on a very large dataset. Basically, I want to loop through all files in a folder and run the function fromJSON on it. However, I want it to skip over files that produce an error. I have built a function using tryCatch however, that only works when i use the function lappy and not parLapply.

Here is my code for my exception handling function:

readJson <- function (file) {
 require(jsonlite)
 dat <- tryCatch(
        {
         fromJSON(file, flatten=TRUE)      
        },
         error = function(cond) {
                 message(cond)
                 return(NA)
        },
         warning = function(cond) {
                  message(cond)
                  return(NULL)
                  }
   )
  return(dat)   
}

and then I call parLapply on a character vector files which contains the full paths to the JSON files:

 dat<- parLapply(cl,files,readJson)

that produces an error when it reaches a file that doesn't end properly and does not create the list 'dat' by skipping over the problematic file. Which is what the readJson function was supposed to mitigate.

When I use regular lapply, however it works perfectly fine. It generates the errors, however, it still creates the list by skipping over the erroneous file.

any ideas on how I could use exception handling with parLappy parallel such that it will skip over the problematic files and generate the list?

2

2 Answers

3
votes

In your error handler function cond is an error condition. message(cond) signals this condition, which is caught on the workers and transmitted as an error to the master. Either remove the message calls or replace them with something like message(conditionMessage(cond)) You won't see anything on the master though, so removing is probably best.

0
votes

What you could do is something like this (with another example, reproducible):

test1 <- function(i) {
  dat <- NA
  try({
    if (runif(1) < 0.8) {
      dat <- rnorm(i)
    } else {
      stop("Error!")
    } 
  })
  return(dat)   
}
cl <- parallel::makeCluster(3)
dat <- parallel::parLapply(cl, 1:100, test1)

See this related question for other solutions. I think using foreach with .errorhandling = "pass" would be another good solution.