6
votes

Is there a reason I can't read a RDS file from within a zip file directly, without having to unzip it to a temp file on disk first?

Let's say this is the zip file:

saveRDS(cars, "cars.rds")
saveRDS(iris, "iris.rds")
write.csv(iris, "iris.csv")
zip("datasets.zip", c("cars.rds", "iris.rds", "iris.csv"))
file.remove("cars.rds", "iris.rds", "iris.csv")

For the csv file, I could read it directly like this:

iris2 <- read.csv(unz("datasets.zip", "iris.csv"))

However, I don't understand why I can't use unz() directly with readRDS():

iris3 <- readRDS(unz("datasets.zip", "iris.rds"))

This gives me the error:

Error: unknown input format

I'd also like to understand why this happens. I'm aware that I could do the following, as in this question:

path <- unzip("datasets.zip", "iris.rds")
iris4 <- readRDS(path)
file.remove(path)

This doesn't seem as efficient, though, and I need to do it frequently for a really large number of files, so I/O inefficiencies matter. Is there any workaround to read the rds file without extracting it to disk?

2

2 Answers

13
votes

This was a little tricky to track down until I read the body of readRDS(). What it seems you need to do is

  1. Open a connection to the .zip archive and the file inside it with unz()
  2. Apply GZIP decompression to this connection using gzcon()
  3. And finally pass this decompressed connection to readRDS().

Here's an example to illustrate using the following serialised matrix mat inside a zip archive matrix.zip

mat <- matrix(1:9, ncol = 3)
saveRDS(mat, "matrix.rds")
zip("matrix.zip", "matrix.rds")

Open a connection to matrix.zip

con <- unz("matrix.zip", filename = "matrix.rds")

Now, using gzcon(), apply GZIP decompression to this connection

con2 <- gzcon(con)

Finally, read from the connection

mat2 <- readRDS(con2)

In full we have

con <- unz("matrix.zip", filename = "matrix.rds")
con2 <- gzcon(con)
mat2 <- readRDS(con2)
close(con2)

This gives

> con <- unz("matrix.zip", filename = "matrix.rds")
> con2 <- gzcon(con)
> mat2 <- readRDS(con2)
> close(con2)
> mat2
     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9
> all.equal(mat, mat2)
[1] TRUE

Why?

Why you have to go through this convoluted extra step is (I think) described in ?readRDS:

Compression is handled by the connection opened when file is a file name, so is only possible when file is a connection if handled by the connection. So e.g. url connections will need to be wrapped in a call to gzcon.

And if you look at the internals of readRDS() we see:

> readRDS
function (file, refhook = NULL) 
{
    if (is.character(file)) {
        con <- gzfile(file, "rb")
        on.exit(close(con))
    }
    else if (inherits(file, "connection")) 
        con <- file
    else stop("bad 'file' argument")
    .Internal(unserializeFromConn(con, refhook))
}
<bytecode: 0x2841998>
<environment: namespace:base>

If file is a character string for the file name, the object is decompressed using gzile() to create the connection to the .rds we want to read. Notice that if you pass a connection as file, as you want to do, at no point has R decompressed the connection. file is just assigned to con and then passed to the internal function unserializeFromConn. Hence wrapping gzcon() around the connection created by unz works.

Basically, when unserializeFromConn reads from a connection it expects it to be decompressed but that decompression only happen automagically when you pass readRDS() a filename, not a connection.

1
votes

The signature for readRDS() is

saveRDS(object, file = "", ascii = FALSE, version = NULL,
    compress = TRUE, refhook = NULL)

but, distressingly, there is nothing in the signature for readRDS. Yet, when you read the documentation for readRDS you get this little gem

## or examine the object via a connection, which will be opened as needed.
con <- gzfile("women.rds")
readRDS(con)
close(con)

and also

## Less convenient ways to restore the object
## which demonstrate compatibility with unserialize()
con <- gzfile("women.rds", "rb")
identical(unserialize(con), women)
close(con)
con <- gzfile("women.rds", "rb")
wm <- readBin(con, "raw", n = 1e4) # size is a guess
close(con)
identical(unserialize(wm), women)

The other thing to consider is what your gains are going to be using compression on an RDS object at all. Consider

X <- matrix(rnorm(1e7), ncol=10)
saveRDS(X, file = "X.rds")
system("cp X.rds XZ.rds")
system("gzip XZ.rds")
uncomp <- file.info("X.rds")$size
comp <- file.info("XZ.rds.gz")$size
savings <- (1 - comp/uncomp)
# [1] -0.00030541

so, for a not atypical example, compressing the RDS object is costing us space.