Currently having an issue with an AWS Glue Job reading an S3 collection and writing it to AWS Redshift where we have a column with null values.
The job should be fairly simple and most of the code is auto-generated by the Glue interface but as we have not null columns in Redshift that are sometimes null in our data set we are unable to get the job to complete.
A condensed version of the code is shown below, code is in Python and environment is PySpark.
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "db_1", table_name = "table_1", transformation_ctx = "datasource0")
resolvedDDF = datasource0.resolveChoice(specs = [
('price_current','cast:double'),
('price_discount','cast:double'),
])
applymapping = ApplyMapping.apply(frame = resolvedDDF, mappings = [
("id", "string", "id", "string"),
("status", "string", "status", "string"),
("price_current", "double", "price_current", "double"),
("price_discount", "double", "price_discount", "double"),
("created_at", "string", "created_at", "string"),
("updated_at", "string", "updated_at", "string"),
], transformation_ctx = "applymapping")
droppedDF = applymapping.toDF().dropna(subset=('created_at', 'price_current'))
newDynamicDF = DynamicFrame.fromDF(droppedDF, glueContext, "newframe")
dropnullfields = DropNullFields.apply(frame = newDynamicDF, transformation_ctx = "dropnullfields")
datasink = glueContext.write_dynamic_frame.from_jdbc_conf(frame = dropnullfields, catalog_connection = "RedshiftDataStaging", connection_options = {"dbtable": "dbtable_1", "database": "database_1"}, redshift_tmp_dir = args["TempDir"], transformation_ctx = "datasink")
We have a not null constraint on the price_current and created_at table in Redshift and due to some early errors in our system some records have reached the S3 bucket without the required data. We want to just drop these rows as they make up a very small percentage of the overall data to process.
Despite the dropna code we still get the following error from Redshift.
Error (code 1213) while loading data into Redshift: "Missing data for not-null field"
Table name: "PUBLIC".table_1
Column name: created_at
Column type: timestampt(0)
Raw field value: @NULL@