AWS Glue PySpark replace NULLs

Question

I am running an AWS Glue job to load a pipe delimited file on S3 into an RDS Postgres instance, using the auto-generated PySpark script from Glue.

Initially, it complained about NULL values in some columns:

pyspark.sql.utils.IllegalArgumentException: u"Can't get JDBC type for null"

After some googling and reading on SO, I tried to replace the NULLs in my file by converting my AWS Glue Dynamic Dataframe to a Spark Dataframe, executing the function fillna() and reconverting back to a Dynamic Dataframe.

datasource0 = glueContext.create_dynamic_frame.from_catalog(database = 
"xyz_catalog", table_name = "xyz_staging_files", transformation_ctx = 
"datasource0")
custom_df = datasource0.toDF()
custom_df2 = custom_df.fillna(-1)
custom_df3 = custom_df2.fromDF()

applymapping1 = ApplyMapping.apply(frame = custom_df3, mappings = [("id", 
"string", "id", "int"),........more code

References:

https://github.com/awslabs/aws-glue-samples/blob/master/FAQ_and_How_to.md#3-there-are-some-transforms-that-i-cannot-figure-out

How to replace all Null values of a dataframe in Pyspark

http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.fillna

Now, when I run my job, it throws the following error:

Log Contents:
Traceback (most recent call last):
File "script_2017-12-20-22-02-13.py", line 23, in <module>
custom_df3 = custom_df2.fromDF()
AttributeError: 'DataFrame' object has no attribute 'fromDF'
End of LogType:stdout

I am new to Python and Spark and have tried a lot, but can't make sense of this. Appreciate some expert help on this.

I tried changing my reconvert command to this:

custom_df3 = glueContext.create_dynamic_frame.fromDF(frame = custom_df2)

But still got the error:

AttributeError: 'DynamicFrameReader' object has no attribute 'fromDF'

UPDATE: I suspect this is not about NULL values. The message "Can't get JDBC type for null" seems not to refer to a NULL value, but some data/type that JDBC is unable to decipher.

I created a file with only 1 record, no NULL values, changed all Boolean types to INT (and replaced values with 0 and 1), but still get the same error:

pyspark.sql.utils.IllegalArgumentException: u"Can't get JDBC type for null"

UPDATE: Make sure DynamicFrame is imported (from awsglue.context import DynamicFrame), since fromDF / toDF are part of DynamicFrame.

Refer to https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-dynamic-frame.html

Can it be because of the column data types?. I remember sth like fillna() and dropna() is affecting only to columns which have string data type by default.. So for example, if the column type is date it can still include null values after fillna() and dropna() — ylcnky
Have you tried using DropNullFields? It removes null fields from a DynamicFrame. The output DynamicFrame does not contain fields of the null type in the schema. This way you can keep only using Dynamic Frames, which are fine-tuned for Glue jobs, avoiding converting to/from data frames. — leovrf

doug doug · Accepted Answer · 2018-04-26T14:48:27

You are calling .fromDF on the wrong class. It should look like this:

from awsglue.dynamicframe import DynamicFrame
DyamicFrame.fromDF(custom_df2, glueContext, 'label')

AWS Glue PySpark replace NULLs

2 Answers