0
votes

I am attempting to write a program that splits a large dataset into smaller datasets of a target size target or less, when GZIP compressed.

Thus far, the best I have come up with is to just track raw string length of the data I've seen so far, and estimate the final size with some GZIP compression ratio guess. This results in some pretty wildly off estimates though. Most of the time, the estimated size is within 20% of my target, but sometimes I'll get files that are 100s of % bigger than my estimate predicts.

Additionally, it seems like the compression estimates are periodic. So if I want 10MB files, I'll end up with mostly 10MB files, but then lumps at 20, 30, 40 MB in the filesize distribution.

So is there any way to make on-the-fly educated guesses as to output compressed file sizes, without actually compression the assembled stream so far? Is it possible with a different compression format? I don't need it to be perfect, but I do want it to be close.

Pseudocode example (in practice I can do this with java, python, or scala. This is just illustrative):

COMPRESSION_RATIO_GUESS = 20
targetSize = 10 * 1024 * 1024

with open("bigfile.txt","r") as f:
    so_far = 0
    for line in f.readlines():
        so_far += len(line)
        if so_far/COMPRESSION_RATIO_GUESS > targetSize:
            # start new file, write rows so far
        
        
    
1

1 Answers

1
votes

As you have already found, such estimations are a terrible way to try to achieve your objective. There are better ways.

We would need to hear more about your application. What is the size of the datasets you are compressing? What sort of target sizes are you trying to get? How close do you want to get to the target size? Where can you split your dataset, and how often, in byte distance, do those occur?

I would recommend a relatively simple approach, which uses zlib's ability to flush out blocks. You would compress some portion of your dataset, and flush the output. Save the length and where that compressed data ended. (You can flush to a byte boundary.) Repeat with another portion. Continue until you go over your target. Then back up to the last time you flushed, and finish that stream with a final block and trailer. Now start a new file with the data you just backed over.

Depending on your numbers, the amount you compress in a block can be chosen to permit you to get close to the target, and not significantly impact the compression ratio.

There are more complex approaches that allow you get as close to the target as possible, as used by fitblk. fitblk compresses past the target, and the decompresses that just up to the target. It then recompresses just that amount, and then decompresses and compresses a third time to get with a few bytes of filling up to the target.