3
votes

Say I have a .txt file like this:

11111111111111Hello and welcome to stackoverflow. stackoverflow will hopefully provide me with answers to answers i do not know. Hello and goodbye.11111111111111

Then I would have an equivalent in binary form (.bin file) created as such:

Stream.Write(intBytes, 0, intBytes.Length); // 11111111111111
Stream.Write(junkText, 0, junkText.Length); // Hello and welcome to stackoverflow...
Stream.Write(intBytes, 0, intBytes.Length); // 11111111111111

The first example compresses better than the second. If i removed the 11111111111111 they compress to the same size. But having the 11111's means the .txt version compresses better.

byte[] intBytes = BitConverter.GetBytes(11111111111111); // This is 8 bytes
byte[] strBytes = UTF8Encoding.UTF8.GetBytes("11111111111111"); // This is 14 bytes

This is using the native C++ Zlib library.

Before compression the .bin file is lesser in size and I was expecting this.

Why is it that after compression the .txt version is lesser in size? It seems it compresses that better than the bin equivalent.

bin file: Uncompressed Size:2448 Compressed Size:177

txt file: Uncompressed Size:2460 Compressed Size:167

1
How is the .txt file stored? UTF-8 or another format? If there is a difference in formats, that explains the difference in size. UTF-8 is a variable width encoding format that can explain the difference you are seeing.JugsteR
What does "compress better" mean? Absolute size or percent? Please post the numbers.usr
I was more interested that the txt file was larger than the bin before compression, but afterwards the txt was compressed better to a smaller file. The default encoding for txt files on windows 7 is utf-8 i think.Science_Fiction

1 Answers

2
votes

So a bigger file compresses to a smaller file. There are two explanations that I can offer:

  1. Compression works when the input has low entropy. Try to compress random data of 180 bytes and the compressed size will be even larger than the best of your test cases. Prepending binary ones means that the compressor has to deal with binary data and text at the same time. New byte values are introduced that do not occur at all in text. This increases entropy of the file.
  2. All compression have weak and strong spots (except for perfect "Kolmogorov"-compression). You might be seeing an anomaly caused by some implementation detail. The difference is not big after all.