I am trying to filter a DataFrame comparing two date columns using Scala and Spark. Based on the filtered DataFrame there are calculations running on top to calculate new columns. Simplified my data frame has the following schema:
|-- received_day: date (nullable = true)
|-- finished: int (nullable = true)
On top of that I create two new column t_start
and t_end
that would be used for filtering the DataFrame. They have 10 and 20 days difference from the original column received_day
:
val dfWithDates= df
.withColumn("t_end",date_sub(col("received_day"),10))
.withColumn("t_start",date_sub(col("received_day"),20))
I now want to have a new calculated column that indicates for each row of data how many rows of the dataframe are in the t_start
to t_end
period. I thought I can achieve this the following way:
val dfWithCount = dfWithDates
.withColumn("cnt", lit(
dfWithDates.filter(
$"received_day".lt(col("t_end"))
&& $"received_day".gt(col("t_start"))).count()))
However, this count only returns 0 and I believe that the problem is in the the argument that I am passing to lt
and gt
.
From following that issue here Filtering a spark dataframe based on date I realized that I need to pass a string value. If I try with hard coded values like lt(lit("2018-12-15"))
, then the filtering works. So I tried casting my columns to StringType
:
val dfWithDates= df
.withColumn("t_end",date_sub(col("received_day"),10).cast(DataTypes.StringType))
.withColumn("t_start",date_sub(col("received_day"),20).cast(DataTypes.StringType))
But the filter still returns an empty dataFrame. I would assume that I am not handling the data type right.
I am running on Scala 2.11.0 with Spark 2.0.2.
lt()
andgt()
? I tried searching for it but couldn't find what I was looking for. – InnaColumn
Data types here : spark.apache.org/docs/latest/api/scala/… – philantrovert