4
votes

I'm building a git clone implementation in Rust. I've gotten to the part where I need to parse the packfile to create the index, and I'm almost done parsing it.

Each object in the packfile consists of a header (Which I'm already parsing correctly) followed by the contents which are zlib compressed.

Notably the size stored in the header is the decompressed size, and is therefore larger than the actual data we have to skip to get to the next header.

Crates.io shows 2 crates that do zlib decompression and have more than a few downloads:

  • libz-sys: Is practically a hello world and has been like that for months
  • flate2: This correctly deflates the data with ease:

    print!("Object type {} size {}", obj_type as u8, obj_size);
    
    println!(" data:\n{}",
        String::from_utf8(
            ZlibDecoder::new(data).read_exact(obj_size as usize).unwrap()
        ).unwrap()
    );
    

Here's the problem. After this I need to start reading the next object's header, but ZlibDecoder doesn't give any way to detect how large the input was.

It takes ownership of a reader as it's input, rather than a reference.

Because of this, even though I have the output size of the object (and indeed all the object's data) since I don't know the input size I can't start reading the next object header.

How do I get the amount of compressed input bytes needed to reach the expected output size? If possible, I'd like to avoid using FFI to call native zlib.

PS: the flate2 docs suggest a helper trait but I have no idea how or if this would help me

1

1 Answers

1
votes

Normally, you can pass a reference to a Reader / Writer (via ByRefReader or ByRefWriter) to allow adding adapters to the stream without losing control of it. Something like this should work:

#![feature(io,path,env)]

extern crate flate2;

use flate2::CompressionLevel;
use flate2::writer::ZlibEncoder;
use flate2::reader::ZlibDecoder;

use std::env;
use std::old_io::File;
use std::old_io::{ByRefReader,ByRefWriter};
use std::old_path::Path;

fn main() {
    let path = "./data";
    let write = env::var("WRITE").is_ok();

    if write {
        println!("Writing to {}", path);
        let mut f = File::create(&Path::new(path)).unwrap();

        fn write_it<W>(w: &mut W, s: &str) where W: Writer {
            let mut z = ZlibEncoder::new(ByRefWriter::by_ref(w), CompressionLevel::Default);
            z.write_all(s.as_bytes()).unwrap();
        }

        write_it(&mut f, "hello world");
        write_it(&mut f, "goodbye world");
    } else {
        println!("Reading from {}", path);
        let mut f = File::open(&Path::new(path)).unwrap();

        fn read_it<R>(r: &mut R) -> String where R: Reader {
            let mut z = ZlibDecoder::new(ByRefReader::by_ref(r));
            z.read_to_string().unwrap()

        }

        println!("{}", read_it(&mut f));
        println!("{}", read_it(&mut f));
    }
}

This does work for writing - I see the Zlib header get repeated twice in the output file. However, it does not work when reading. It looks like reader::ZlibDecoder might consume all the way to the end of the underlying Reader. This could potentially be a bug or an oversight in the flate2 library. A few minutes of staring at the source hasn't shown anything obvious though.

Edit

Here's a terrible hack that "works" though:

fn read_it<R>(r: &mut R) -> String where R: Reader {
    let mut z = ZlibDecoder::new_with_buf(ByRefReader::by_ref(r), Vec::with_capacity(1));
    z.read_to_string().unwrap()
}

println!("{}", read_it(&mut f));
f.seek(-1, std::old_io::SeekStyle::SeekCur);
println!("{}", read_it(&mut f));

The problem arises because flate2 is a bit greedy in how it reads from the reader. It always tries to fill its own internal buffer as much as it can, even if some of that data isn't going to be read. This terrible, nasty hack causes it to only ever read a single byte at a time. Thus, you can rewind one byte at the end and start again.

A longer-term solution is probably to add an accessor for total_in up to Stream and then up until you get to ZlibDecoder.