python - Read ZIP files from S3 without downloading the entire file

Question

We have ZIP files that are 5-10GB in size. The typical ZIP file has 5-10 internal files, each 1-5 GB in size uncompressed.

I have a nice set of Python tools for reading these files. Basically, I can open a filename and if there is a ZIP file, the tools search in the ZIP file and then open the compressed file. It's all rather transparent.

I want to store these files in Amazon S3 as compressed files. I can fetch ranges of S3 files, so it should be possible to fetch the ZIP central directory (it's the end of the file, so I can just read the last 64KiB), find the component I want, download that, and stream directly to the calling process.

So my question is, how do I do that through the standard Python ZipFile API? It isn't documented how to replace the filesystem transport with an arbitrary object that supports POSIX semantics. Is this possible without rewriting the module?

Janaka Bandara Janaka Bandara · Accepted Answer · 2018-09-22T08:43:37

Here's an approach which does not need to fetch the entire file (full version available here).

It does require boto (or boto3), though (unless you can mimic the ranged GETs via AWS CLI; which I guess is quite possible as well).

import sys
import zlib
import zipfile
import io

import boto
from boto.s3.connection import OrdinaryCallingFormat


# range-fetches a S3 key
def fetch(key, start, len):
    end = start + len - 1
    return key.get_contents_as_string(headers={"Range": "bytes=%d-%d" % (start, end)})


# parses 2 or 4 little-endian bits into their corresponding integer value
def parse_int(bytes):
    val = ord(bytes[0]) + (ord(bytes[1]) << 8)
    if len(bytes) > 3:
        val += (ord(bytes[2]) << 16) + (ord(bytes[3]) << 24)
    return val


"""
bucket: name of the bucket
key:    path to zipfile inside bucket
entry:  pathname of zip entry to be retrieved (path/to/subdir/file.name)    
"""

# OrdinaryCallingFormat prevents certificate errors on bucket names with dots
# https://stackguides.com/questions/51604689/read-zip-files-from-amazon-s3-using-boto3-and-python#51605244
_bucket = boto.connect_s3(calling_format=OrdinaryCallingFormat()).get_bucket(bucket)
_key = _bucket.get_key(key)

# fetch the last 22 bytes (end-of-central-directory record; assuming the comment field is empty)
size = _key.size
eocd = fetch(_key, size - 22, 22)

# start offset and size of the central directory
cd_start = parse_int(eocd[16:20])
cd_size = parse_int(eocd[12:16])

# fetch central directory, append EOCD, and open as zipfile!
cd = fetch(_key, cd_start, cd_size)
zip = zipfile.ZipFile(io.BytesIO(cd + eocd))


for zi in zip.filelist:
    if zi.filename == entry:
        # local file header starting at file name length + file content
        # (so we can reliably skip file name and extra fields)

        # in our "mock" zipfile, `header_offset`s are negative (probably because the leading content is missing)
        # so we have to add to it the CD start offset (`cd_start`) to get the actual offset

        file_head = fetch(_key, cd_start + zi.header_offset + 26, 4)
        name_len = parse_int(file_head[0:2])
        extra_len = parse_int(file_head[2:4])

        content = fetch(_key, cd_start + zi.header_offset + 30 + name_len + extra_len, zi.compress_size)

        # now `content` has the file entry you were looking for!
        # you should probably decompress it in context before passing it to some other program

        if zi.compress_type == zipfile.ZIP_DEFLATED:
            print zlib.decompressobj(-15).decompress(content)
        else:
            print content
        break

In your case you might need to write the fetched content to a local file (due to large size), unless memory usage is not a concern.

python - Read ZIP files from S3 without downloading the entire file

3 Answers