1
votes

I have numerous files that are compressed in the bz2 format and I am trying to uncompress them in a temporary directory using python to then analyze. There are hundreds of thousands of files so manually decompressing the files isn't feasible so I wrote the following script.

My issue is that whenever I try to do this, the maximum file size is 900 kb even though a manual decompression has each file around 6 MB. I am not sure if this is a flaw in my code and how I am saving the data as a string to then copy to the file or a problem with something else. I have tried this with different files and I know that it works for files smaller than 900 kb. Has anyone else had a similar problem and knows of a solution?

My code is below:

import numpy as np
import bz2
import os
import glob

def unzip_f(filepath):
    '''
    Input a filepath specifying a group of Himiwari .bz2 files with common names
    Outputs the path of all the temporary files that have been uncompressed

    '''


    cpath = os.getcwd() #get current path
    filenames_ = []  #list to add filenames to for future use

    for zipped_file in glob.glob(filepath):  #loop over the files that meet the name criterea
        with bz2.BZ2File(zipped_file,'rb') as zipfile:   #Read in the bz2 files
            newfilepath = cpath +'/temp/'+zipped_file[-47:-4]     #create a temporary file
            with open(newfilepath, "wb") as tmpfile: #open the temporary file
                for i,line in enumerate(zipfile.readlines()):
                    tmpfile.write(line) #write the data from the compressed file to the temporary file



            filenames_.append(newfilepath)
    return filenames_


path_='test/HS_H08_20180930_0710_B13_FLDK_R20_S*bz2'
unzip_f(path_)   

It returns the correct file paths with the wrong sizes capped at 900 kb.

2
Side-note: Get rid of the enumerate and .readlines(); you don't care about the line number, and .readlines() forces you to hold the whole decompressed file in memory, not just a line at a time (the file-like object already iterates by line, so forcing it to eagerly slurp to a list is just wasting memory). - ShadowRanger
Don't read binary data line by line. Use shutil.copyfileobj() to handle decompression efficiently. Or use the shutil archive decompression functions. - Martijn Pieters♦
Also, with bz2.open(zipped_file) as zipfile, open(newfilepath, 'wb') as tmpfile: shutil.copyfileobj(zipfile, tmpfile) would be faster and simpler for that matter. (Looks like Martijn beat me to that suggestion). - ShadowRanger
Thanks for the help. Those suggestions help the efficiency but the files are still capping out at 900 KB for some reason. - BenT
What Python version are you using? The bz2 implementation in 2.7 and 3.3 does not properly handle files created by certain compressors such as pbzip2 - see bugs.python.org/issue20781 for details. - jasonharper

2 Answers

1
votes

It turns out this issue is due to the files being multi stream which does not work in python 2.7. There is more info here as mentioned by jasonharper and here. Below is a solution just using the Unix command to decompress the bz2 files and then moving them to the temporary directory I want. It is not as pretty but it works.

import numpy as np
import os
import glob
import shutil

def unzip_f(filepath):
    '''
    Input a filepath specifying a group of Himiwari .bz2 files with common names
    Outputs the path of all the temporary files that have been uncompressed

    '''


    cpath = os.getcwd() #get current path
    filenames_ = []  #list to add filenames to for future use

    for zipped_file in glob.glob(filepath):  #loop over the files that meet the name criterea
        newfilepath = cpath +'/temp/'   #create a temporary file
        newfilename = newfilepath + zipped_file[-47:-4]

        os.popen('bzip2 -kd ' + zipped_file)
        shutil.move(zipped_file[-47:-4],newfilepath)

        filenames_.append(newfilename)
    return filenames_



path_='test/HS_H08_20180930_0710_B13_FLDK_R20_S0*bz2'

unzip_f(path_)   
0
votes

This is a known limitation in Python2, where the BZ2File class doesn't support multiple streams. This can be easily resolved by using bz2file, https://pypi.org/project/bz2file/, which is a backport of Python3 implementation and can be used as a drop-in replacement.

After running pip install bz2file you can just replace bz2 with it: import bz2file as bz2 and everything should just work :)

The original Python bug report: https://bugs.python.org/issue1625