How to handle '0000-00-00' in AWS glue - pyspark

Question

I am using AWS glue to read AWS Aurora DB (MySQL) tables and write into S3 files. MySQL table has date column values '0000-00-00', which is allowed in MySQL. Due to this my Glue job (PySpark) is failing. How to handle this in Glue code ?

Thing I tried and failed.

Append jdbc:mysql:<host-name>/<db-name>?zeroDateTimeBehavior=convertToNull&autoReconnect=true&characterEncoding=UTF-8&characterSetResults=UTF-8
Drop date columns from DynamicFrame or Dataframe in PySpark code. eg: df.drop(df["date_column"])
Deleted date column from Glue table definition. Looks like all the columns in the table are read.

Error message below

Traceback (most recent call last):
File "script_2018-08-03-21-41-06.py", line 107, in <module>
total_record_count=datasourceDF0.count()
File "/mnt/yarn/usercache/root/appcache/application_1533330570684_0005/container_1533330570684_0005_01_000001/pyspark.zip/pyspark/sql/dataframe.py", line 427, in count
File "/mnt/yarn/usercache/root/appcache/application_1533330570684_0005/container_1533330570684_0005_01_000001/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
File "/mnt/yarn/usercache/root/appcache/application_1533330570684_0005/container_1533330570684_0005_01_000001/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/mnt/yarn/usercache/root/appcache/application_1533330570684_0005/container_1533330570684_0005_01_000001/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o335.count.

: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 7.0 failed 4 times, most recent failure: Lost task 0.3 in stage 7.0 (TID 21, ip-172-24-120-182.us-west-2.compute.internal, executor 1): java.sql.SQLException: Value '0000-00-00' can not be represented as java.sql.Timestamp
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:996)
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:935)
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:924)
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:870)
at com.mysql.jdbc.ResultSetRow.getNativeTimestamp(ResultSetRow.java:606)
at com.mysql.jdbc.ByteArrayRow.getNativeTimestamp(ByteArrayRow.java:187)
at com.mysql.jdbc.ResultSetImpl.getNativeTimestamp(ResultSetImpl.java:4309)
at com.mysql.jdbc.ResultSetImpl.getTimestampInternal(ResultSetImpl.java:5929)
at com.mysql.jdbc.ResultSetImpl.getTimestamp(ResultSetImpl.java:5609)

I already tried this in Glue connection (#1) and failed. Thank you. The article was helpful. — Kiran
Have you tried dropping the data_column in the ApplyMapping transform function? — cozyss
Yes. I removed date_column in ApplyMapping transformation. It is not working. I am passing only the required columns in ApplyMapping(). The required columns contain the date table which has 00 date. Even if I remove those columns I am getting error. Looks like glueContext.create_dynamic_frame.from_catalog() method reads all the columns in MySQL tables and ignores the table defined in Glue Data Catalog. — Kiran

Marcel Bezerra Marcel Bezerra · Accepted Answer · 2019-06-12T19:15:15

Verify the format that the field is set in the crownler table schema. Set it to string. This way you don't have parse error anymore. Select the column with the function df.selectExpr() and format the data as you want. Some Spark SQL expressions:

How to handle '0000-00-00' in AWS glue - pyspark

1 Answers