9
votes

I have a complicated list object, the output of a modelling function (asreml). The object contains all sorts of data types, including functions and formulas, which have environments attached. I don't want to save the environments to RDS, because they are quite big and I save a lot of models.

I came across the parameter refhook= in the serialize and saveRDS functions. The documentation says:

The refhook functions can be used to customize handling of non-system reference objects (all external pointers and weak references, and all environments other than namespace and package environments and .GlobalEnv). The hook function for serialize should return a character vector for references it wants to handle; otherwise it should return NULL.

Given this example model

e <- new.env()
e$a = rnorm(10)
l <- list(a = e, b = 42)

The refhook function indeed show some effect. The output gets smaller when I define a function which returns a character, indicating that the environment does not get saved:

length(serialize(l, connection = NULL))
[1] 338

s <- serialize(l, 
  connection = NULL, 
  refhook = function(x) "")
length(s)
[1] 109

However, I cannot read in the resulting object:

unserialize(s)

Error in unserialize(s) : 
  no restore method available

I also tried a raw vector output, suspecting that maybe refhook is expected to provide an alternative serialized output, but that won't work:

s2 <- serialize(l,
  connection = NULL, 
  refhook = function(x) 
    serialize("env", connection = NULL)))

Error in serialize(l, con = NULL, refhook = function(x) serialize("env",  : 
  assertion 'TYPEOF(t) == STRSXP && LENGTH(t) > 0' failed: file 'serialize.c', line 982

How do I use refhook=? What character output is expected from this function?

1

1 Answers

8
votes

Ah, I found it out myself. The error "no restore method available" means that you forgot to include a refhook for the unserialize function. You need both, a refhook for serialize and unserialize.

The refhook of serialize is completely free in what string to return. The only one who needs to understand the result is the refhook of unserialize.

Example: Serialize and restore lists which contain environments stored centrally

Generate a repository of environments. Lets pretend that these come from an external source and their contents don't need to be serialized. To restore them, the external data source just needs to be reread.

repo <- list()
for(i in 1:10){
  repo[[i]] <- new.env()
  repo[[i]]$a <- rnorm(1e6)
}

One environment is 8 MB large. We don't want to have all this data in our serialized output because it is already saved permanently in repo.

object.size(repo[[1]]$a)

This is the list we want to serialize. It contains the second environment from the repository. We just want to store the numeric value b. For the environment, we just want to store that it's the environment 2 from the repository. We don't want to serialize the contents, because the repository already has them.

l <- list(a = repo[[2]], b = 42)

This is the refhook for serialize. It looks up the environment in the index and just stores the index.

ser <- function(e){
  for(i in seq_along(repo)){
    if(identical(e, repo[[i]])){
      message("Identified environment #",i)
      return(as.character(i)) # Just save the 
    }
  }
  message("Environment not found in the repository")
  return(NULL)
}

The corresponding refhook for unserialize takes the index and loads the corresponding environment from repo:

unser <- function(s){
  i <- as.numeric(s)
  return(repo[[i]])
}

This saves a lot of space in the serialized output

  • Without custom refhook: also contains the environment

    object.size(serialize(l, con = NULL))
    ## 8000040 bytes
    
  • With custom refhook: Only l$b and the environment index are saved

    s <- serialize(l, con = NULL, refhook = ser)
    object.size(s)
    ## 168 bytes
    

The environment is loaded from the database when unserialising

u <- unserialize(s, refhook = unser)
## $a
## <environment: 0x000000001c91a118>
## 
## $b
## [1] 42