0
votes

I'm creating an ETL job in AWS Glue that would pull from an S3 location the most recent edits, or the current data, for each entity in the repository. The data in the repository is a historical account of all edits for the entities. Each day I run the ETL and it writes out to another S3 location, i.e. Bucket/path/to/files/current_date/... where the current date is dynamic and consistent with the date of the ETL run.

The issue I am running into is that I cannot delete programatically from S3 (organizational restriction), or move the files as that is a copy and delete behind the scenes so it also fails, leaving a single path for glue to crawl. I would like to set up the crawler such that the date portion of the path is dynamic but I have not been able to figure out a way to do that--does anyone know if that is possible?

My data is partitioned by the run_date (see current date above) as well as 6 other hierarchical partitions. I am creating the crawlers and the ETL jobs via CloudFormation, yaml language. The paths for the crawlers are stored as ssm parameters defined in the CloudFormation scripts.

Path SSM Param Example

S3CurrentPath:
    Type: AWS::SSM::Parameter
    Properties:
      Description: "Path in the S3 Lake where the current entity data is stored."
      Type: String
      Value: 'Data/Entities/Software/SoftwareCurrent'
      Name: "/org/member/local/s3/path/entityCurrent"

Crawler Resource Code:

GenericCrawler:
    Type: AWS::Glue::Crawler
    Properties:
      Role: !Ref RoleNAme
      Name: !Sub "${ProfileName}-crawler-${CrawlerName}"
      Configuration: !Sub |
        {
          "Version": 1.0,
          "CrawlerOutput": {
            "Partitions": { "AddOrUpdateBehavior": "InheritFromTable" },
            "Tables": { "AddOrUpdateBehavior": "MergeNewColumns" }
          }
        }
      Description: !Ref CrawlerDescription
      DatabaseName: !Ref DatabaseName
      Targets:
        S3Targets:
          - Path: !Sub "s3://${S3DataBucket}/${S3Path}"

ETL DataSink Write Code:

# Write the joined dynamic frame out to a datasink
        datasink = glueContext.write_dynamic_frame.from_options(
                frame = final_dynamic_frame, connection_type = "s3",
                connection_options = {
                    'path': 's3://{lakeBucketName}/{lakePath}/'.format(
                        lakeBucketName=args['lakeBucketName'],
                        lakePath=args['lakeDestinationPath']),
                        "partitionKeys": ['run_date','location','year','month','day','hour','timestamp']},
                format = "parquet",
                transformation_ctx = "datasink")

My hope is that the crawler would look at the most current date in the repository, i.e. the most recent run_date partition "folder" and crawl that without looking back over the older data.

Please let me know if you would like to see more of the code--I will be happy to sanitize and provide.

1

1 Answers

1
votes

To be honest I haven't found a way to read/write data to dynamic paths using AWS Glue. What I generally do is to read/write using PySpark methods:

datasink.write.\
        format("com.databricks.spark.csv").\
        option("header", "true").\
        mode("overwrite").\
        save("s3://my-bucket/files/" + current_date + "*.csv")

You can even tell the method just to read/write a specific type of files (E.g. .csv). PySpark has more options and available methods than AWS Glue, so there is more flexibility. Also, I add a key/value record in a DynamoDB table to keep a record of the latest date.