0
votes

I have a AWS Glue Job moving data from RDS table to Redshift.

both table has same schema:

-- RDS
CREATE TABLE my_table (
   id varchar(256) not null primary key
   col1 varchar(256) not null
) 

-- Redshift
CREATE TABLE my_table (
   id varchar(256) not null 
   col1 varchar(256) not null
) sortkey(id)

I crawled both schemas and wrote a trivial job to write DynamicFrame from RDS source to Redshift sink.

val datasource = glueContext.getCatalogSource(
      database = "my_rds",
      tableName = "my_table",
      redshiftTmpDir = ""
    ).getDynamicFrame()

glueContext.getCatalogSink(
      database = "my_redshift",
      tableName = "my_table",
      redshiftTmpDir = "s3://some-bucket/some-path"
    ).writeDynamicFrame(datasource)

but the job fails for rows with empty string values of col1 with:

java.sql.SQLException:
Error (code 1213) while loading data into Redshift: "Missing data for not-null field"
Table name: my_table
Column name: col1
Column type: varchar(254)
Raw line: 3027616797,@NULL@
Raw field value: @NULL@

When I debug this with a glue-spark-shell I can verify that the value is an empty string "".

scala> datasource.toDF().filter("id = '3027616797'").select("col1").collect().head.getString(0)
res23: String = ""                                                              

How can I tell glue to distinguish between empty strings "" and NULLs?

1

1 Answers

0
votes

It looks like it's a problem in Databricks Datasource for Redshift (docs) (apparently AWS Glue uses it internally). There are open tickets regarding exactly this problem but they have not been touched for over a year:

I have tried that code, but the result is exactly the same:

datasource
  .toDF()
  .write
  .format("com.databricks.spark.redshift")
  .option("url", "<RS_JDBC_URL>?user=<USER>&password=<PASSWORD>")
  .option("dbtable", "my_table")
  .option("tempdir", "s3://S_PATH")
  .option("forward_spark_s3_credentials", "true")
  .save