I have a AWS Glue Job moving data from RDS table to Redshift.
both table has same schema:
-- RDS
CREATE TABLE my_table (
id varchar(256) not null primary key
col1 varchar(256) not null
)
-- Redshift
CREATE TABLE my_table (
id varchar(256) not null
col1 varchar(256) not null
) sortkey(id)
I crawled both schemas and wrote a trivial job to write DynamicFrame from RDS source to Redshift sink.
val datasource = glueContext.getCatalogSource(
database = "my_rds",
tableName = "my_table",
redshiftTmpDir = ""
).getDynamicFrame()
glueContext.getCatalogSink(
database = "my_redshift",
tableName = "my_table",
redshiftTmpDir = "s3://some-bucket/some-path"
).writeDynamicFrame(datasource)
but the job fails for rows with empty string values of col1 with:
java.sql.SQLException:
Error (code 1213) while loading data into Redshift: "Missing data for not-null field"
Table name: my_table
Column name: col1
Column type: varchar(254)
Raw line: 3027616797,@NULL@
Raw field value: @NULL@
When I debug this with a glue-spark-shell I can verify that the value is an empty string "".
scala> datasource.toDF().filter("id = '3027616797'").select("col1").collect().head.getString(0)
res23: String = ""
How can I tell glue to distinguish between empty strings "" and NULLs?