I'm trying to copy some data to Aurora Database from Athena tables on regular basis. I could figure out how to setup the JDBC Connection, crawler and create a Job that is able to successfully copy all data from Athena to Aurora Database. However, running the same job multiple times is causing duplicates. I have partially solved it using Job Bookmarks but if Athena table data changes for few records the S3 file timestamps change and Glue thinks it is new data and reinserts the data into aurora table.
I found an article that shows how to run pre and post actions upon running a Glue job. How do I execute SQL commands on an Amazon Redshift table before or after writing data in an AWS Glue job?. I'm trying to do same thing by inserting data in a stage table and upsert into target table afterwards. Here is the final portion of the glue script that is supposed to do upsert.
pre_query="drop table if exists stage;create table stage as select * from target where 1=2;"
post_query="begin;DELETE FROM target WHERE EXISTS(SELECT 1 FROM stage WHERE stage.id = target.id); insert into target select * from stage; drop table stage; commit;"
datasink4 = glueContext.write_dynamic_frame.from_jdbc_conf(
frame = dropnullfields3,
catalog_connection = "connection",
connection_options = {
"preactions":pre_query,
"postactions":post_query,
"dbtable": "stage",
"database": "prod_db"
},
transformation_ctx = "datasink4"
)
job.commit()
I can see that the stage table is being created and data is copied fine, but the queries in preactions and postactions is not happening. Can anyone tell me what is going on and how to acheive this? One way I could think of is create a lambda that can execute these pre and post queries to be triggered before and after job execution. Are there ways to get it done inside glue itself?