1
votes

I need to perform an append load to S3 bucket.

  1. Every day new .gz file gets dumped to a S3 location and glue crawler reads the data and update it in data catalog.
  2. The Scala AWS Glue job runs and only filters the data for the current day.
  3. The above filtered data is transformed as per some rules and a partitioned dynamic data frame (i.e. year,month,day) level is created.

Now I need to write this dynamic data frame to S3 bucket which has all the previous day partitions present. In-fact I just need to write only one partition to the S3 bucket.Currently I am using the below piece of code to write data to S3 bucket.

// Write it out in Parquet for ERROR severity   
    glueContext.getSinkWithFormat(
    connectionType = "s3",
    options = JsonOptions(Map("path" -> "s3://some s3 bucket location", 
    "partitionKeys" -> Seq("partitonyear","partitonmonth","partitonday"))),
     format = "parquet").writeDynamicFrame(DynamicFrame(dynamicDataframeToWrite.toDF().coalesce(maxExecutors), glueContext)) 

I am not sure if the above piece of code will perform an append load or not.Is there a way through AWS glue libraries to achieve the same?

1

1 Answers

1
votes

Your script will append new data files to appropriate partition. So if you are processing only today's data then it will create a new data partition under the path. For example, if today is 2018-11-28 it will create new data object in s3://some_s3_bucket_location/partitonyear=2018/partitonmonth=11/partitonday=28/ folder.

If you try to write data into existing partition then Glue will append new files and will not remove existing objects. However this may lead to duplicates if run a job multiple times to process the same data.