5
votes

I have around 270k data block pairs, each pair consists of one 32KiB and one 16KiB block.

When I save them to one file I of course get a very large file. But the data is easily compressed.
After compressing the 5.48GiB file with WinRAR, with strong compression, the resulting file size is 37.4MiB.

But I need random access to each individual block, so I can only compress the blocks individually.
For that I used the Deflate class provided by .NET, which reduced the file size to 382MiB (which I could live with).
But the speed is not good enough.

A lot of the speed loss is probably due to always creating a new MemoryStream and Deflate instance for each block. But it seems they aren't designed to be reused.

And I guess (much?) better compression can be achieved when a "global" dictionary is used instead having one for each block.

Is there an implementation of a compression algorithm (preferably in C#) which is suited for that task?

The following link contains the percentage with which each byte number occurs, divided into three block types (32KiB blocks only). The first and third block type has an occurrence of 37,5% and the second 25%. Block type percentages

Long file short story: Type1 consists mostly of ones. Type2 consists mostly of zeros and ones Type3 consists mostly of zeros Values greater than 128 do not occur (yet).

The 16KiB block consists almost always of zeros

3
Actually, in deflate, the dictionary is limited to 32K (a "sliding window" where the oldest byte is removed as each input byte is added).Cameron
Are you willing to write your own compressor/decompressor?Cameron

3 Answers

5
votes

If you want to try different compression you can start with RLE which shoud be suitable for your data - http://en.wikipedia.org/wiki/Run-length_encoding - it will be blazingly fast even in simplest implemetation. The related http://en.wikipedia.org/wiki/Category:Lossless_compression_algorithms contains more links to start on other algorithm if you want to roll you own or find someone's implementation.

Random comment: "...A lot of the speed loss is probably ..." is not a way to solve performance problem. Measure and see if it really is.

4
votes

Gzip is known to be "fine", which means compression ratio is okay, and speed is good. If you want more compression, other alternatives exist, such as 7z.

If you want more speed, which seems your objective, a faster alternative will provide a significant speed advantage at the cost of some compression efficiency. "Significant" shall be translated into many times faster, such as 5x-10x. Such algorithms are favored for "in-memory" compression scenarios, such as yours, since they make accessing the compressed block almost painless.

As an example, Clayton Stangeland just released LZ4 for C#. The source code is available here under a BSD license : https://github.com/stangelandcl/LZ4Sharp

There are some comparisons metrics with gzip on the project homepage, such as :

i5 memcpy 1658 MB/s
i5 Lz4 Compression 270 MB/s Decompression 1184 MB/s  
i5 LZ4C# Compression 207 MB/s Decompression 758 MB/s 49%
i5 LZ4C# whole corpus Compression 267 MB/s Decompression 838 MB/s Ratio 47%
i5 gzip whole corpus Compression 48 MB/s Decompression 266 MB/s Ratio 33%

Hope this helps.

2
votes

You can't have random access to a Deflate stream, no matter how hard you try (unless you forfeit the LZ77 part, but that's what's mostly responsible for making your compression ratio so high right now -- and even then, there's tricky issues to circumvent). This is because one part of the compressed data is allowed to refer to previous part up to 32K bytes back, which may also refer to another part in turn, etc. and you end up having to start decoding the stream from the beginning to get the data you want, even if you know exactly where it is in the compressed stream (which, currently, you don't).

But, what you could do is compress many (but not all) blocks together using one stream. Then you'd get fairly good speed and compression, but you wouldn't have to decompress all the blocks to get at the one you wanted; just the particular chunk that your block happens to be in. You'd need an additional index that tracks where each compressed chunk of blocks starts in the file, but that's fairly low overhead. Think of it as a compromise between compressing everything together (which is great for compression but sucks for random access), and compressing each chunk individually (which is great for random access but sucks for compression and speed).