4
votes

Requirement here is that in the source bucket we receive historical daily files. The files are of the format -

Source bucket -

s3://sourcebucket/abc/111111_abc_1180301000014_1-3_1180301042833.txt
s3://sourcebucket/abc/111111_cde_1180302000042_2-3_1180302042723.txt

These are sample values as I can't post the exact file name -

111111_abc_1180301000014_1-3_1180301042833.txt

where 1180301000014 is the date and time 180301 - date March 1st 2018 and 000014 is hours, minutes and seconds - hhmmss

Once we receive all the hourly files for March 1st, we need to copy those files to another bucket and then do further processing. Currently, the copy part is working fine. It copies all the files present in the source bucket to the destination. But, I am not sure how to apply filter such that it picks only March 1st days file first and copies it to another bucket. Then it should pick the remaining files in sequential order.

Python script -

import boto3
import json
s3 = boto3.resource('s3')


def lambda_handler(event, context):
    bucket = s3.Bucket('<source-bucket>')
    dest_bucket = s3.Bucket('<destination-bucket>')

    for obj in bucket.objects.filter(Prefix='abc/',Delimiter='/'):
        dest_key = obj.key
        print(dest_key)
        s3.Object(dest_bucket.name, dest_key).copy_from(CopySource = {'Bucket': obj.bucket_name, 'Key': obj.key})

I am not that well versed in python. In fact this is my first python script. Any guidance is appreciated.

1
Your code would need to grab "today's date", subtract a day, then find filenames matching that day. However, you need to be careful of timezones - what is your definition of "Once we receive all the files for March 1st"? What timezone is used by the files? If you run code on an EC2 instance, it will use UTC as the timezone unless you specifically code otherwise. - John Rotenstein
These are historical files. My code will be run through Lambda. - Shash

1 Answers

2
votes

You can extract the date string portion of the filename (ideally by splitting the string on '_') and pass it into a handling function such as:

from datetime import datetime as dt

def parse_date(date_string):
    form = "%y%m%d%H%M%S"
    date = dt.strptime(date_string, form)

    #dt.utcnow() will return a UTC representation of the current time
    diff = dt.now() - date

    if diff.days >= 1:
        return False

    return True

#False
print(parse_date("180301000014"))
#True as of the date of this post
print(parse_date("180606000014"))

You can look at https://docs.python.org/3/library/datetime.html for more info on handling dates in Python. You will need to account for time zones as well.

For matching by day to a target date:

def by_target_date(date_string, target_date):
    form = "%y%m%d%H%M%S"
    date = dt.strptime(date_string, form)

    if date > target_date:
        #Check that days match and that month and year are the same
        if date.day == target_date.day and (date - target_date).days <= 1:
            return do_things()

    if date.day == target_date.day and (target_date - date).days <= 1:
        return do_things()