3
votes

I would like to read compressed files directly from Google Cloud Storage and open them with the Python csv package. The code for a local file would be:

def reader(self):
    print "reading local compressed file: ", self._filename
    self._localfile = gzip.open(self._filename, 'rb')
    csvReader = csv.reader(self._localfile, delimiter=',', quotechar='"')
    return csvReader

I have played with several GCS APIs (JSON based, cloud.storage), but none of them seem to give me something that I can stream through gzip. What is more, even if the file was uncompressed, I could not open the file and give it to cv.reader (Iterator type).

My compressed CSV files are about 500MB, while uncompressed they use up to a few GB. I don't think it would be a good idea to: 1 - locally download the files before opening them (unless I can overlap download and computation) or 2 - Open it entirely in memory before computing.

Finally, I current run this code on my local machine, but ultimately, I will move to AppEngine, so it must work there too.

Thanks!!

2
What about split your file in multiple parts? - Raito
that's already multiple parts of a 1+TB dataset. :D Breaking it further seems like an unnecessary idea. I'm trying Alex Martelli's suggestion. - user1066293

2 Answers

6
votes

Using GCS, cloudstorage.open(filename, 'r') will give you a read-only file-like object (earlier created similarly but with 'w':-) which you can use, a chunk at a time, with the standard Python library's zlib module, specifically a zlib.decompressobj, if, of course, the GS object was originally created in the complementary way (with a zlib.compressobj).

Alternatively, for convenience, you can use the standard Python library's gzip module, e.g for the reading phase something like:

compressed_flo = cloudstorage.open('objname', 'r')
uncompressed_flo = gzip.GzipFile(fileobj=compressed_flo,mode='rb')
csvReader = csv.reader(uncompressed_flo)

and vice versa for the earlier writing phase, of course.

Note that when you run locally (with the dev_appserver), the GCS client library uses local disk files to simulate GCS -- in my experience that's good for development purposes, and I can use gsutil or other tools when I need to interact with "real" GCS storage from my local workstation... GCS is for when I need such interaction from my GAE app (and for developing said GAE app locally in the first place:-).

4
votes

So, you have gzipped files stored on GCS. You can process the data stored on GCS in a stream-like fashion. That is, you can download, unzip, and process simultaneously. This avoids

  • to have the unzipped file on disk
  • to have to wait until the download is complete before being able to process the data.

gzip files have a small header and footer, and the body is a compressed stream, consisting of a series of blocks, and each block is decompressable on its own. Python's zlib package helps you with that!

Edit: This is example code for how to decompress and analzye a zlib or gzip stream chunk-wise, purely based on zlib:

import zlib
from collections import Counter


def stream(filename):
    with open(filename, "rb") as f:
        while True:
            chunk = f.read(1024)
            if not chunk:
                break
            yield chunk


def decompress(stream):
    # Generate decompression object. Auto-detect and ignore
    # gzip wrapper, if present.
    z = zlib.decompressobj(32+15)
    for chunk in stream:
        r = z.decompress(chunk)
        if r:
            yield r


c = Counter()
s = stream("data.gz")
for chunk in decompress(s):
    for byte in chunk:
        c[byte] += 1


print c

I tested this code with an example file data.gz, created with GNU gzip.

Quotes from http://www.zlib.net/manual.html:

windowBits can also be greater than 15 for optional gzip decoding. Add 32 to windowBits to enable zlib and gzip decoding with automatic header detection, or add 16 to decode only the gzip format (the zlib format will return a Z_DATA_ERROR). If a gzip stream is being decoded, strm->adler is a crc32 instead of an adler32.

and

Any information contained in the gzip header is not retained [...]