Downloading a large archive from AWS Glacier using Boto

Question

I am trying to download a large archive (~ 1 TB) from Glacier using the Python package, Boto. The current method that I am using looks like this:

import os
import boto.glacier
import boto
import time

ACCESS_KEY_ID = 'XXXXX'
SECRET_ACCESS_KEY = 'XXXXX'
VAULT_NAME = 'XXXXX'
ARCHIVE_ID = 'XXXXX'
OUTPUT = 'XXXXX'

layer2 = boto.connect_glacier(aws_access_key_id = ACCESS_KEY_ID,
                              aws_secret_access_key = SECRET_ACCESS_KEY)

gv = layer2.get_vault(VAULT_NAME)

job = gv.retrieve_archive(ARCHIVE_ID)
job_id = job.id

while not job.completed:
    time.sleep(10)
    job = gv.get_job(job_id)

if job.completed:
    print "Downloading archive"
    job.download_to_file(OUTPUT)

The problem is that the job ID expires after 24 hours, which is not enough time to retrieve the entire archive. I will need to break the download into at least 4 pieces. How can I do this and write the output to a single file?

your download takes longer than 24hrs? I mean, you are limited by bandwidth? On EC2, extract/resend it to S3 so you can download later, or pull it to an EC2 box and download from there. — tedder42

volent volent · Accepted Answer · 2015-01-16T16:02:12

It seems that you can simply specify the chunk_size parameter when calling job.download_to_file like so :

if job.completed:
    print "Downloading archive"
    job.download_to_file(OUTPUT, chunk_size=1024*1024)

However, if you can't download the all the chunks during the 24 hours I don't think you can choose to download only the one you missed using layer2.

First method

Using layer1 you can simply use the method get_job_output and specify the byte-range you want to download.

It would look like that :

file_size = check_file_size(OUTPUT)

if job.completed:
    print "Downloading archive"
    with open(OUTPUT, 'wb') as output_file:
        i = 0
        while True:
            response = gv.get_job_output(VAULT_NAME, job_id, (file_size + 1024 * 1024 * i, file_size + 1024 * 1024 * (i + 1)))
            output_file.write(response)
            if len(response) < 1024 * 1024:
                break
            i += 1

With this script you should be able to rerun the script when it fails and continue to download your archive where you left it.

Second method

By digging in the boto code I found a "private" method in the Job class that you might also use : _download_byte_range. With this method you can still use layer2.

file_size = check_file_size(OUTPUT)

if job.completed:
    print "Downloading archive"
    with open(OUTPUT, 'wb') as output_file:
        i = 0
        while True:
            response = job._download_byte_range(file_size + 1024 * 1024 * i, file_size + 1024 * 1024 * (i + 1)))
            output_file.write(response)
            if len(response) < 1024 * 1024:
                break
            i += 1

Downloading a large archive from AWS Glacier using Boto

2 Answers

First method

Second method