streaming gzipped files from google cloud storage

Question

I would like to read compressed files directly from Google Cloud Storage and open them with the Python csv package. The code for a local file would be:

def reader(self):
    print "reading local compressed file: ", self._filename
    self._localfile = gzip.open(self._filename, 'rb')
    csvReader = csv.reader(self._localfile, delimiter=',', quotechar='"')
    return csvReader

I have played with several GCS APIs (JSON based, cloud.storage), but none of them seem to give me something that I can stream through gzip. What is more, even if the file was uncompressed, I could not open the file and give it to cv.reader (Iterator type).

My compressed CSV files are about 500MB, while uncompressed they use up to a few GB. I don't think it would be a good idea to: 1 - locally download the files before opening them (unless I can overlap download and computation) or 2 - Open it entirely in memory before computing.

Finally, I current run this code on my local machine, but ultimately, I will move to AppEngine, so it must work there too.

Thanks!!

that's already multiple parts of a 1+TB dataset. :D Breaking it further seems like an unnecessary idea. I'm trying Alex Martelli's suggestion. — user1066293

Alex Martelli Alex Martelli · Accepted Answer · 2015-02-08T21:43:55

Using GCS, cloudstorage.open(filename, 'r') will give you a read-only file-like object (earlier created similarly but with 'w':-) which you can use, a chunk at a time, with the standard Python library's zlib module, specifically a zlib.decompressobj, if, of course, the GS object was originally created in the complementary way (with a zlib.compressobj).

Alternatively, for convenience, you can use the standard Python library's gzip module, e.g for the reading phase something like:

compressed_flo = cloudstorage.open('objname', 'r')
uncompressed_flo = gzip.GzipFile(fileobj=compressed_flo,mode='rb')
csvReader = csv.reader(uncompressed_flo)

and vice versa for the earlier writing phase, of course.

Note that when you run locally (with the dev_appserver), the GCS client library uses local disk files to simulate GCS -- in my experience that's good for development purposes, and I can use gsutil or other tools when I need to interact with "real" GCS storage from my local workstation... GCS is for when I need such interaction from my GAE app (and for developing said GAE app locally in the first place:-).

streaming gzipped files from google cloud storage

2 Answers