Download millions of records from s3 bucket based on modified date

Question

I am trying to download millions of records from s3 bucket to NAS. Because there is not particular pattern for filenames, I can rely solely on modified date to execute multiple CLI's in parallel for quicker download. I am unable to find any help to download files based on modified date. Any inputs would be highly appreciated!

someone mentioned about using s3api, but not sure how to use s3api with cp or sync command to download files.

current command:

aws --endpoint-url http://example.com s3 cp s3:/objects/EOB/ \\images\OOSS\EOB --exclude "*" --include "Jun" --recursive

I think this is wrong because include here would be referring to inclusion of 'Jun' within the file name and not as modified date.

John Rotenstein John Rotenstein · Accepted Answer · 2019-08-22T23:29:08

The AWS CLI will copy files in parallel.

Simply use aws s3 sync and it will do all the work for you. (I'm not sure why you are providing an --endpoint-url)

Worst case, if something goes wrong, just run the aws s3 sync command again.

It might take a while for the sync command to gather the list of objects, but just let it run.

If you find that there is a lot of network overhead due to so many small files, then you might consider:

Launch an Amazon EC2 instance in the same region (make it fairly big to get large network bandwidth; cost isn't a factor since it won't run for more than a few days)
Do an aws s3 sync to copy the files to the instance
Zip the files (probably better in several groups rather than one large zip)
Download the zip files via scp, or copy them back to S3 and download from there

This way, you are minimizing the chatter and bandwidth going in/out of AWS.

Download millions of records from s3 bucket based on modified date

2 Answers