AWS Glue to Redshift: duplicate data?

Question

Here are some bullet points in terms of how I have things setup:

I have CSV files uploaded to S3 and a Glue crawler setup to create the table and schema. I have a Glue job setup that writes the data from the Glue table to our Amazon Redshift database using a JDBC connection. The Job also is in charge of mapping the columns and creating the redshift table. By re-running a job, I am getting duplicate rows in redshift (as expected).

However, is there way to replace or delete rows before inserting the new data?

BOOKMARK functionality is Enable but not working.

How can I connect to redshift, delete all data as a part of JOB before pushing data to redshift in Python?

Yuriy Bondaruk Yuriy Bondaruk · Accepted Answer · 2018-09-20T08:14:31

Currently Glue doesn't support bookmarking for JDBC sources.

You can implement upsert/merge into Redshift in Glue job using postactions option (code in Scala):

val fields = sourceDf.columns.mkString(",")

glueContext.getJDBCSink(
  catalogConnection = "RedshiftConnectionTest",
  options = JsonOptions(Map(
    "database" -> "conndb",
    "dbtable" -> "staging_schema.staging_table",
    "postactions" -> 
        s"""
           DELETE FROM dst_schema.dst_table USING staging_schema.staging_table AS S WHERE dst_table.id = S.id;
           INSERT INTO dst_schema.dst_table ($fields) SELECT $fields FROM staging_schema.staging_table;
           DROP TABLE IF EXISTS staging_schema.staging_table
        """
  )),
  redshiftTmpDir = tempDir,
  transformationContext = "redshift-output"
).writeDynamicFrame(DynamicFrame(sourceDf, glueContext))

If you just want to delete existing table then you can use preactions parameter instead:

glueContext.getJDBCSink(
  catalogConnection = "RedshiftConnectionTest",
  options = JsonOptions(Map(
    "database" -> "conndb",
    "dbtable" -> "dst_schema.dst_table",
    "preactions" -> "DELETE FROM dst_schema.dst_table"
  )),
  redshiftTmpDir = tempDir,
  transformationContext = "redshift-output"
).writeDynamicFrame(DynamicFrame(sourceDf, glueContext))

AWS Glue to Redshift: duplicate data?

5 Answers