I have a dataframe with column as Date along with few other columns.
I wanted to validate Date column value and check if the format is of "dd/MM/yyyy". If Date column holds any other format than should mark it as bad record. So I am using option option("dateFormat", "dd/MM/yyyy") to accept date in mentioned format and it accepts the date properly in format "dd/MM/yyyy", but if I pass invalid format (YYYY/mm/dd) still record is not marking as invalid and passed date is converting to garbage
Input File :
colData1,2020/05/07,colData2,colData3
colData4,2020/05/07,colData5,colData6
colData7,2020/05/07,colData8,colData9
df = spark.read.format(
"com.databricks.spark.csv").schema(customSchema).option(
"escape", '\"').option(
"quote", '\"').option(
"header", "false").option(
"dateFormat", "dd/MM/yyyy").option(
"columnNameOfCorruptRecord","badRecords").csv(
rdd)
df.show()
DataFrame O/P :-
+--------+----------+--------+--------+----------+-----+
| OMIC| SMIC| VCLP| VName|badRecords|rowId|
+--------+----------+--------+--------+----------+-----+
|colData1|0012-11-09|colData2|colData3| null| 0|
|colData4|0012-11-09|colData5|colData6| null| 1|
|colData7|0012-11-09|colData8|colData9| null| 2|
+--------+----------+--------+--------+----------+-----+
Please suggest