0
votes

I have a table with a column 'date' and the date format is yyyyMMdd. I need to filter this dataframe and return a dataframe with only rows with dates greater than an input, For eg: Return all the rows where date is greater than "20180715". I did the following.

scala> df.groupBy("date").count.show(50,false)  
+--------+----------+                                                              
|date    |count     |  
+--------+----------+  
|20180707|200       |  
|20180715|1429586969| 
|20180628|1425490080| 
|20180716|1429819708|  
+--------+----------+ 

scala> var con = df.filter(to_date(df("date"),"yyyyMMdd").gt(lit("20180715")))

scala> con.count
res4: Long = 0

scala> var con = df.filter(to_date(df("date"),"yyyyMMdd").gt(lit("20170715")))

scala> con.count
res1: Long = 4284896957 

When I input the date as "20170715", it counts all the records, whereas if the date is "20180715", the filter condition does not work. What is the correct way to compare with a string date.

1
scala> var con = df.filter(to_date(df("date"),"yyyyMMdd").gt(lit("2018-07-15"))) After changing the format of input, it worked scala> con.count res6: Long = 1429819708Albin
If column date is of string type, you can simply do df.filter($"date" > "20180715") since a date string with format "yyyyMMdd" would be in the suitable order for string comparison.Leo C

1 Answers

1
votes

Changing the format of the input string passed to the lit function, solved this issue.

scala> var con = df.filter(to_date(df("date"),"yyyyMMdd").gt(lit("2018-07-15"))) 

scala> con.count 
res6: Long = 1429819708