Stream interface for data compression in Common Lisp

Question

In the chipz decompression library there is an extremely useful function make-decompressing-stream, which provides an interface (using Gray streams behind the scenes) to transparently decompress data read from the provided stream. This allows me to write a single function read-tag (which reads a single "tag" from a stream of structured binary data, much like Common Lisp's read function reads a single Lisp "form" from a stream) that works on both compressed and uncompressed data, eg:

;; For uncompressed data:
(read-tag in-stream)
;; For compressed data:
(read-tag (chipz:make-decompressing-stream 'chipz:zlib in-stream))

As far as I can tell, the API of the associated compression library, salza2, doesn't provide an (out-of-the-box) equivalent interface for performing the reverse task. How could I implement such an interface myself? Let's call it make-compressing-stream. It will be used with my own complementary write-tag function, and provide the same benefits as for reading:

;; For uncompressed-data:
(write-tag out-stream current-tag)
;; For compressed data:
(write-tag (make-compressing-stream 'salza2:zlib-compressor out-stream)
                 current-tag)

In salza2's documentation (linked above), in the overview, it says: "Salza2 provides an interface for creating a compressor object. This object acts as a sink for octets (either individual or vectors of octets), and is a source for octets in a compressed data format. The compressed octet data is provided to a user-defined callback that can write it to a stream, copy it to another vector, etc." For my current purposes, I only require compression in zlib and gzip formats, for which standard compressors are provided.

So here's how I think it could be done: Firstly, convert my "tag" object to an octet vector, secondly compress it using salza2:compress-octet-vector, and thirdly, provide a callback function that writes the compressed data directly to a file. From reading around, I think the first step could be achieved using flexi-streams:with-output-to-sequence - see here - but I'm really not sure about the callback function, despite looking at salza2's source. But here's the thing: a single tag can contain an arbitrary number of arbitrarily nested tags, and the "leaf" tags of this structure can each carry a sizeable payload; in other words, a single tag can be quite a lot of data.

So the tag->uncompressed-octets->compressed-octets->file conversion would ideally need to be performed in chunks, and this raises a question that I don't know how to answer, namely: compression formats - AIUI - tend to store in their headers a checksum of their payload data; if I compress the data one chunk at a time and append each compressed chunk to an output file, surely there will be a header and checksum for each chunk, as opposed to a single header and checksum for the entire tag's data, which is what I want? How can I solve this problem? Or is it already handled by salza2?

Thanks for any help, sorry for rambling :)

coredump coredump · Accepted Answer · 2016-08-02T06:54:57

From what I understand, you can't directly decompress multiple chunks from a single file.

(defun bytes (&rest elements)
    (make-array (length elements) 
     :element-type '(unsigned-byte 8)
     :initial-contents elements))

(defun compress (chunk &optional mode)
  (with-open-file (output #P"/tmp/compressed"
                          :direction :output
                          :if-exists mode
                          :if-does-not-exist :create
                          :element-type '(unsigned-byte 8))
    (salza2:with-compressor (c 'salza2:gzip-compressor
                               :callback (salza2:make-stream-output-callback output))
      (salza2:compress-octet-vector chunk c))))

(compress (bytes 10 20 30) :supersede)
(compress (bytes 40 50 60) :append)

Now, /tmp/compressed contains two consecutive chunks of compressed data. Calling decompress reads the first chunk only:

(chipz:decompress nil 'chipz:gzip #P"/tmp/compressed")
=> #(10 20 30)

Looking at the source of chipz, the stream is read using an internal buffer, which means the bytes that follows the first chunk are probably already read but not decompressed. That explains why, when using two consecutive decompress calls on the same stream, the second one errors with EOF.

(with-open-file (input #P"/tmp/compressed" 
                       :element-type '(unsigned-byte 8))
  (list
   #1=(multiple-value-list(ignore-errors(chipz:decompress nil 'chipz:gzip input)))
   #1#))

=> ((#(10 20 30))
    (NIL #<CHIPZ:PREMATURE-END-OF-STREAM {10155E2163}>))

I don't know how large the data is supposed to be, but if it ever becomes a problem, you might need to change the decompression algorithm so that when we are in the done state (see inflate.lisp), enough data is returned to process the remaining bytes as a new chunk. Or, you compress into different files and use an archive format like TAR (see https://github.com/froydnj/archive).

Stream interface for data compression in Common Lisp

1 Answers