1
votes

I am trying to copy the latest file based on Last Modified from AWS S3 Folder_Test1 folder to a Folder_Test2 folder in the same bucket and using exclude and include in the copy command.

Folder_Test1:

Name                               Last Modified
T1_abc_june21.csv                  June 21,2020 9:27:03 AM GMT-0700
T1_abc_june21.csv                  June 21,2020 7:40:15 PM GMT-0700
T1_abc_june21.csv                  June 21,2020 9:20:32 PM GMT-0700
T1_abc_june25.csv                  June 25,2020 10:23:30 PM GMT-0700
T2_abc_june29.csv                  June 29,2020 6:15:12 AM GMT-0700
T2_abc_june29.csv                  June 29,2020 5:12:15 PM GMT-0700 (Fetch this object)
T1_abc_def_june21.csv              June 21,2020 6:13:15 PM GMT-0700
T2_abc_def_june25.csv              June 25,2020 5:33:10 AM GMT-0700
T3_abc_def_june25.csv              June 25,2020 9:31:15 PM GMT-0700 (Fetch this object)

I have to filter the file name having only the latest file of abc and exclude the copy files:

I tried: Step 1 Copy abc files from Folder_Test1 to Folder_Test2:

aws s3 cp  s3://$bucket/Folder_Test1/ s3://$bucket/Folder_Test2/ --recursive --exclude "*abc_def*"

Step 2 It will Fetche latest abc file from Folder_Test2:

aws s3 ls  s3://$bucket/Folder_Test2/ --recursive | sort | tail -n 1 | awk '{print $4}'

How can I copy the latest file from Folder_Test2 to Folder_Test3? or How can I remove all other files except the latest file from Folder_Test2?

3
You won't be able to get this directly out of the API, you will need local order/filter logic. - jordanm

3 Answers

1
votes

I was able to get this to work but it requires some shell related code and jq. In a Linux environment I was able to do something like:

aws s3 cp s3://$bucket/`aws s3api list-objects-v2 --bucket $bucket --prefix Folder_Test1/ | jq -r '.[] | sort_by(.LastModified)[-1].Key'` $bucket/Folder_Test2/

What does this do? The first part finds the most recent file that starts with "Folder_Test1/" in this example:

aws s3api list-objects-v2 --bucket $bucketsource --prefix Folder_Test1/ | jq -r '.[] | sort_by(.LastModified)[-1].Key'

Thew we pipe this output to jq to let it sort by the LastModified field and get the Key of that item. Note that this is using s3api so that we can get the raw JSON to read.

Once we have that we use that output as input to the cp command.

This was tested with V2 of the AWS CLI (2.0.30) on an Ubuntu system. The jq command was already installed.

1
votes

This command will list the 'latest' object for a given prefix:

aws s3api list-objects --bucket MY-BUCKET --prefix foo/ --query 'sort_by(Contents, &LastModified)[-1].Key' --output text

You could combine it with a copy command:

key=$(aws s3api list-objects --bucket SOURCE-BUCKET --prefix foo/ --query 'sort_by(Contents, &LastModified)[-1].Key' --output text)
aws s3 cp s3://SOURCE-BUCKET/$key s3://DEST-BUCKET/

The --query parameter is very powerful. See: JMESPath Tutorial

However, the list-objects command cannot be combined with --include/--exclude.

Frankly, it would probably be easier to write a small Python script to accomplish your goal.

1
votes

How many files are you scanning?

If it's 100,000 or more, you might want to use something faster than aws-cli. S3P uses a parallel listing algorithm to accelerate S3 bucket listing by more than 10x.

All you need to install is NodeJs. Then run s3p with:

npx s3p map \
  --bucket  my-bucket \
  --prefix  Folder_Test1/ \
  --reduce  "js:(a, b) => a.LastModified > b.lastModified ? a : b" \
  --finally "js:({Key}) => Key"

That will output the key of the most recently modified file in Folder_Test1/.

More info:

Disclaimer: I wrote S3P for working with very large buckets.