0
votes

OK I want to load data from amazon s3 into a dynamic frame but limit it by a date range. My data is stored in parquet files in s3 in this format:
s3://bucket/all-dates/year=2021/month=4/day=13/
s3://bucket/all-dates/year=2021/month=4/day=14/
s3://bucket/all-dates/year=2021/month=4/day=15/
s3://bucket/all-dates/year=2021/month=4/day=16/

Currently I load the data into my script as:

ds1 = glueContext.create_dynamic_frame_from_options(
    connection_type = "s3",
    connection_options =
        {"paths":
            [
                "s3://bucket/all-dates/"
            ],
            "recurse": True
        },
      format = "parquet"
)

Which is fine as currently it loads all data into the dataframe. But what I would like to do is somehow only recurse through the latest week, or latest 2 weeks of files based from the date the script runs.

Any help appreciated. Thanks

1

1 Answers

1
votes

You can build a list of dates, then construct a list of S3 paths, then pass it to the options

start_date = '2020-01-01'
end_date = '2020-01-10'
paths = [f's3://bucket/all-dates/year={d.year}/month={d.month}/day={d.day}/' for d in pd.date_range(start_date, end_date)]
# ['s3://bucket/all-dates/year=2020/month=1/day=1/',
#  's3://bucket/all-dates/year=2020/month=1/day=2/',
#  's3://bucket/all-dates/year=2020/month=1/day=3/',
#  's3://bucket/all-dates/year=2020/month=1/day=4/',
#  's3://bucket/all-dates/year=2020/month=1/day=5/',
#  's3://bucket/all-dates/year=2020/month=1/day=6/',
#  's3://bucket/all-dates/year=2020/month=1/day=7/',
#  's3://bucket/all-dates/year=2020/month=1/day=8/',
#  's3://bucket/all-dates/year=2020/month=1/day=9/',
#  's3://bucket/all-dates/year=2020/month=1/day=10/']

ds1 = glueContext.create_dynamic_frame_from_options(
    connection_type = "s3",
    connection_options =
        {
            "paths": paths,
            "recurse": True # probably unnecessary since we gave the exact paths
        },
      format = "parquet"
)