Validate date format in a dataframe column in pyspark

Question

I have a dataframe with column as Date along with few other columns.

I wanted to validate Date column value and check if the format is of "dd/MM/yyyy". If Date column holds any other format than should mark it as bad record. So I am using option option("dateFormat", "dd/MM/yyyy") to accept date in mentioned format and it accepts the date properly in format "dd/MM/yyyy", but if I pass invalid format (YYYY/mm/dd) still record is not marking as invalid and passed date is converting to garbage

Input File :

colData1,2020/05/07,colData2,colData3
colData4,2020/05/07,colData5,colData6
colData7,2020/05/07,colData8,colData9

    df = spark.read.format(
        "com.databricks.spark.csv").schema(customSchema).option(
        "escape", '\"').option(
        "quote", '\"').option(
        "header", "false").option(
        "dateFormat", "dd/MM/yyyy").option(
        "columnNameOfCorruptRecord","badRecords").csv(
        rdd)

    df.show()

DataFrame O/P :-

+--------+----------+--------+--------+----------+-----+
|    OMIC|      SMIC|    VCLP|   VName|badRecords|rowId|
+--------+----------+--------+--------+----------+-----+
|colData1|0012-11-09|colData2|colData3|      null|    0|
|colData4|0012-11-09|colData5|colData6|      null|    1|
|colData7|0012-11-09|colData8|colData9|      null|    2|
+--------+----------+--------+--------+----------+-----+

Please suggest

chlebek chlebek · Accepted Answer · 2020-05-07T15:34:32

You can't achieve it with dateFormat option, it accepts 2020/05/07 but changes format to invalid 0012-11-09. You can try to change SMIC column type to StringType in your schema and then convert it to date with correct format using function to_date.

from pyspark.sql import functions as f
df2 = df.withColumn("badRecord", f.when(f.to_date(f.col("SMIC"),"dd/MM/yyyy").isNotNull, False).otherwise(True))

output:

+--------+----------+--------+--------+---------+
|    OMIC|      SMIC|    VCLP|   VName|badRecord|
+--------+----------+--------+--------+---------+
|colData1|2020/05/07|colData2|colData3|     true|
|colData4|2020/05/07|colData5|colData6|     true|
|colData7|2020/05/07|colData8|colData9|     true|
|colData7|07/05/2020|colData8|colData9|    false|
+--------+----------+--------+--------+---------+

Validate date format in a dataframe column in pyspark

1 Answers