I have a table with a column 'date' and the date format is yyyyMMdd. I need to filter this dataframe and return a dataframe with only rows with dates greater than an input, For eg: Return all the rows where date is greater than "20180715". I did the following.
scala> df.groupBy("date").count.show(50,false)
+--------+----------+
|date |count |
+--------+----------+
|20180707|200 |
|20180715|1429586969|
|20180628|1425490080|
|20180716|1429819708|
+--------+----------+
scala> var con = df.filter(to_date(df("date"),"yyyyMMdd").gt(lit("20180715")))
scala> con.count
res4: Long = 0
scala> var con = df.filter(to_date(df("date"),"yyyyMMdd").gt(lit("20170715")))
scala> con.count
res1: Long = 4284896957
When I input the date as "20170715", it counts all the records, whereas if the date is "20180715", the filter condition does not work. What is the correct way to compare with a string date.
date
is of string type, you can simply dodf.filter($"date" > "20180715")
since a date string with format "yyyyMMdd" would be in the suitable order for string comparison. – Leo C