2
votes

I'm implementing a tool that parses a huge set of 248GB files compressed in bz2 format. The average compression factor is 0.04, so it's quite out of question decompressing them to over 6 terabytes beforehand.

Each line of the content files is a complete JSON record, so I'm reading the files using bz2 module open then a for line in bz2file lasso, and it works nicely. The problem is I don't have any idea on how to show any measure of progress, 'cause I don't know how many compressed bytes I've read nor how many records there are in each file. Files are just huge. Some are up to 24 GB.

How would you approach this?

1

1 Answers

0
votes

Naive method

you could use tqdm like so:

from tqdm import tqdm

with open("hugefile.bz2", "r") as bz2file:
    for line in tqdm(bz2file, desc="hugefile"):
        ...

This way you will know how many lines you've processed in how much time. If you want to get a % of where you are in the process though, you'll need to know beforehand how many lines there is in the file.
If you don't know you can compute it like this:

from tqdm import tqdm

total = 0
with open("hugefile.bz2", "r") as bz2file:
    for line in bz2file:
        total += 1

with open("hugefile.bz2", "r") as bz2file:
    for line in tqdm(bz2file, desc="hugefile", total=total):
        ...

But this implies going over the file twice so you might not want to do it.

Bytes method

Another method would be to figure out how much bytes the line you're reading is, using this: https://stackoverflow.com/a/30686735/8915326

And combine it with the total file size

import os
from tqdm import tqdm

hugefile = "hugefile.bz2"
with open(hugefile, "r") as bz2file:
    with tqdm(desc=hugefile, total=os.path.getsize(hugefile)) as pbar:
        for line in bz2file:
            ...
            linesize = len(line.encode("utf-8"))
            pbar.update(linesize)

This way you're not going over your file twice but you still have to figure out how much bytes is each line.