0
votes

I am trying to download millions of records from s3 bucket to NAS. Because there is not particular pattern for filenames, I can rely solely on modified date to execute multiple CLI's in parallel for quicker download. I am unable to find any help to download files based on modified date. Any inputs would be highly appreciated!

someone mentioned about using s3api, but not sure how to use s3api with cp or sync command to download files.

current command:

aws --endpoint-url http://example.com s3 cp s3:/objects/EOB/ \\images\OOSS\EOB --exclude "*" --include "Jun" --recursive 

I think this is wrong because include here would be referring to inclusion of 'Jun' within the file name and not as modified date.

2

2 Answers

3
votes

The AWS CLI will copy files in parallel.

Simply use aws s3 sync and it will do all the work for you. (I'm not sure why you are providing an --endpoint-url)

Worst case, if something goes wrong, just run the aws s3 sync command again.

It might take a while for the sync command to gather the list of objects, but just let it run.

If you find that there is a lot of network overhead due to so many small files, then you might consider:

  • Launch an Amazon EC2 instance in the same region (make it fairly big to get large network bandwidth; cost isn't a factor since it won't run for more than a few days)
  • Do an aws s3 sync to copy the files to the instance
  • Zip the files (probably better in several groups rather than one large zip)
  • Download the zip files via scp, or copy them back to S3 and download from there

This way, you are minimizing the chatter and bandwidth going in/out of AWS.

2
votes

I'm assuming you're looking to sync arbitrary date ranges, and not simply maintain a local synced copy of the entire bucket (which you could do with aws s3 sync).

You may have to drive this from an Amazon S3 Inventory. Use the inventory list, and specifically the last modified timestamps on objects, to build a list of objects that you need to process. Then partition those somehow and ship sub-lists off to some distributed/parallel process to get the objects.