Is it possible to perform a batch upload to amazon s3?

75

votes

Does amazon s3 support batch uploads? I have a job that needs to upload each night ~100K of files that can be up to 1G but is strongly skewed towards small files (90% are less than 100 bytes and 99% are less than 1000 bytes long).

Does the s3 API support uploading multiple objects in a single HTTP call?

All the objects must be available in S3 as individual objects. I cannot host them anywhere else (FTP, etc) or in another format (Database, EC2 local drive, etc). That is an external requirement that I cannot change.

web-servicesamazon-web-servicesamazon-s3cloudblob

is it ok for me to ask these questions? - Alex Gordon

I am wondering why such requirement appears. If you need to replace all files at once, maybe there's some way to upload them to temporary bucket in a regular way and then change bucket names? - Eugene Mayevski 'Callback

You could have a look to JetS3t, which is quite fully featured in regard to S3 syncing with multithreading. - Thibault D.

Is the accepted answer for this question still valid? It's been 5 years so just curious if anything has changed in that time... - Abe Miessler

45

votes

Alternatively, you can upload S3 via AWS CLI tool using the sync command.

aws s3 sync local_folder s3://bucket-name

You can use this method to batch upload files to S3 very fast.

39

votes

Does the s3 API support uploading multiple objects in a single HTTP call?

No, the S3 PUT operation only supports uploading one object per HTTP request.

You could install S3 Tools on your machine that you want to synchronize with the remote bucket, and run the following command:

s3cmd sync localdirectory s3://bucket/

Then you could place this command in a script and create a scheduled job to run this command each night.

This should do what you want.

The tool performs the file synchronization based on MD5 hashes and filesize, so collision should be rare (if you really want you could just use the "s3cmd put" command to force blind overwriting of objects in your target bucket).

EDIT: Also make sure that you read the documentation on the site I linked for S3 Tools - there are different flags needed for whether you want files deleted locally to be deleted from the bucket or ignored etc.

3

votes

To add on to what everyone is saying, if you want your java code (instead of the CLI) to do this without having to put all of the files in a single directory, you can create a list of files to upload and then supply that list to the AWS TransferManager's uploadFileList method.

https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/s3/transfer/TransferManager.html#uploadFileList-java.lang.String-java.lang.String-java.io.File-java.util.List-

1

votes

If you want to use Java program to do it you can do:

public  void uploadFolder(String bucket, String path, boolean includeSubDirectories) {
    File dir = new File(path);
    MultipleFileUpload upload = transferManager.uploadDirectory(bucket, "", dir, includeSubDirectories);
    try {
        upload.waitForCompletion();
    } catch (InterruptedException e) {
        e.printStackTrace();
    }
}

Creation of s3client and transfer manager to connect to local S3 if you wish to test is as below:

    AWSCredentials credentials = new BasicAWSCredentials(accessKey, token);
    s3Client = new AmazonS3Client(credentials); // This is deprecated but you can create using standard beans provided by spring/aws
    s3Client.setEndpoint("http://127.0.0.1:9000");//If you wish to connect to local S3 using minio etc...
    TransferManager transferManager = TransferManagerBuilder.standard().withS3Client(s3Client).build();

1

votes

Here's a comprehensive batch solution that copies files from one folder to another using a single call of CommandPool::batch, although under the hood it runs a executeAsync command for each file, not sure it counts as a single API call. As I understand you should be able to copy hundreds of thousands of files using this method as there's no way to send a batch to AWS to be processed there.

Install the SDK:

composer require aws/aws-sdk-php

use Aws\ResultInterface;
use Aws\S3\S3Client;
use Aws\S3\Exception\S3Exception;
use Aws\S3\Exception\DeleteMultipleObjectsException;

$bucket = 'my-bucket-name';

// Setup your credentials in the .aws folder
// See: https://docs.aws.amazon.com/sdk-for-php/v3/developer-guide/guide_credentials_profiles.html
$s3 = new S3Client([
    'profile' => 'default',
    'region'  => 'us-east-2',
    'version' => 'latest'
]);

// Get all files in S3
$files = array();
try {
    $results = $s3->getPaginator('ListObjects', [
        'Bucket' => $bucket,
        'Prefix' => 'existing-folder' // Folder within bucket, or remove this to get all files in the bucket
    ]);

    foreach ($results as $result) {
        foreach ($result['Contents'] as $object) {
            $files[] = $object['Key'];
        }
    }
} catch (AwsException $e) {
    error_log($e->getMessage());
}

if(count($files) > 0){
    // Perform a batch of CopyObject operations.
    $batch = [];
    foreach ($files as $file) {
        $batch[] = $s3->getCommand('CopyObject', array(
            'Bucket'     => $bucket,
            'Key'        => str_replace('existing-folder/', 'new-folder/', $file),
            'CopySource' => $bucket . '/' . $file,
        ));
    }

    try {
        $results = CommandPool::batch($s3, $batch);

        // Check if all files were copied in order to safely delete the old directory
        $count = 0;
        foreach($results as $result) {
            if ($result instanceof ResultInterface) {
                $count++;
            }
            if ($result instanceof AwsException) {
            }
        }

        if($count === count($files)){
            // Delete old directory
            try {
                $s3->deleteMatchingObjects(
                    $bucket, // Bucket
                    'existing-folder' // Prefix, folder within bucket, as indicated above
                );
            } catch (DeleteMultipleObjectsException $exception) {
                return false;
            }

            return true;
        }

        return false;

    } catch (AwsException $e) {
        return $e->getMessage();
    }
}

0

votes

One file (or part of a file) = one HTTP request, but the Java API now supports efficient multiple file upload without having to write the multithreading on your own, by using TransferManager

0

votes

Survey

Is it possible to perform a batch upload to Amazon S3?

Yes^*.

Does the S3 API support uploading multiple objects in a single HTTP call?

No.

Explanation

Amazon S3 API doesn't support bulk upload, but awscli supports concurrent (parallel) upload. From the client perspective and bandwidth efficiency these options should perform roughly the same way.

 ────────────────────── time ────────────────────►

    1. Serial
 ------------------
   POST /resource
 ────────────────► POST /resource
   payload_1     └───────────────► POST /resource
                   payload_2     └───────────────►
                                   payload_3
    2. Bulk
 ------------------
   POST /bulk
 ┌────────────┐
 │resources:  │
 │- payload_1 │
 │- payload_2 ├──►
 │- payload_3 │
 └────────────┘

    3. Concurrent
 ------------------
   POST /resource
 ────────────────►
   payload_1

   POST /resource
 ────────────────►
   payload_2

   POST /resource
 ────────────────►
   payload_3

AWS Command Line Interface

Documentation on how can I improve the transfer performance of the sync command for Amazon S3? suggests to increase concurrency in two ways. One of them is this:

To potentially improve performance, you can modify the value of max_concurrent_requests. This value sets the number of requests that can be sent to Amazon S3 at a time. The default value is 10, and you can increase it to a higher value. However, note the following:

Running more threads consumes more resources on your machine. You must be sure that your machine has enough resources to support the maximum number of concurrent requests that you want.

Too many concurrent requests can overwhelm a system, which might cause connection timeouts or slow the responsiveness of the system. To avoid timeout issues from the AWS CLI, you can try setting the --cli-read-timeout value or the --cli-connect-timeout value to 0.

A script setting max_concurrent_requests and uploading a directory can look like this:

aws configure set s3.max_concurrent_requests 64
aws s3 cp local_path_from s3://remote_path_to --recursive

To give a clue about running more threads consumes more resources, I did a small measurement in a container running aws-cli (using procpath) by uploading a directory with ~550 HTML files (~40 MiB in total, average file size ~72 KiB) to S3. The following chart shows CPU usage, RSS and number of threads of the uploading aws process.

Is it possible to perform a batch upload to amazon s3?

7 Answers

Survey

Explanation

AWS Command Line Interface