I am attempting to write a program that splits a large dataset into smaller datasets of a target size target
or less, when GZIP compressed.
Thus far, the best I have come up with is to just track raw string length of the data I've seen so far, and estimate the final size with some GZIP compression ratio guess. This results in some pretty wildly off estimates though. Most of the time, the estimated size is within 20% of my target, but sometimes I'll get files that are 100s of % bigger than my estimate predicts.
Additionally, it seems like the compression estimates are periodic. So if I want 10MB files, I'll end up with mostly 10MB files, but then lumps at 20, 30, 40 MB in the filesize distribution.
So is there any way to make on-the-fly educated guesses as to output compressed file sizes, without actually compression the assembled stream so far? Is it possible with a different compression format? I don't need it to be perfect, but I do want it to be close.
Pseudocode example (in practice I can do this with java, python, or scala. This is just illustrative):
COMPRESSION_RATIO_GUESS = 20
targetSize = 10 * 1024 * 1024
with open("bigfile.txt","r") as f:
so_far = 0
for line in f.readlines():
so_far += len(line)
if so_far/COMPRESSION_RATIO_GUESS > targetSize:
# start new file, write rows so far