1
votes

I wanted to use the date_trunc function on the dataframe which has the date column so that I can create new column that would give me information about which quarter the record is associated with.

Stuffs that i have tried is as below :

import org.apache.spark.sql.functions._
val test = Seq(("2010-03-05"),("2018-01-16"),("2018-04-20")).toDF("TestDates")
display(test)  //this displays the date in the notebook

val datetrunctest = test.withColumn("Quarter", date_trunc("QUARTER",$"TestDates"))
display(datetrunctest)  //this gives me an error saying **error: not found: value date_trunc**

Also when I try to use the import statement with the function name it gives me error as below :

import org.apache.spark.sql.functions.date_trunc
Error : value date_trunc is not a member of object org.apache.spark.sql.functions

I am able to use the same function in spark sql as below by saving above dataframe test as a table "DailyDates":

val ddd = spark.sql("Select TestDates,date_trunc('QUARTER', TestDates) as QuarterDate from test.DailyDates")
display(ddd)

I have lot of transformation/aggregation that needs to be performed on the dataframe so I am looking to find an way by which I could make this work on the dataframe by adding additional column. According to the documentation if you are using spark version greater than 2.3.0 this should work and I am using spark version 2.4.3 .

Snapshot image for the spark version : enter image description here

Document Link : https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/sql/functions.html#date_trunc-java.lang.String-org.apache.spark.sql.Column-

Does anyone have any idea on what could be the issue and how I can get this working?

2
Looks totally fine, just make sure you use spark version >2.3.0koiralo
@koiralo I have verified that the version that I am using is 2.4.3 but it is still giving me that error.Nikunj Kakadiya

2 Answers

0
votes

Check the version of spark you are using,

Open Spark shell Terminal and enter the command

sc.version Or spark-submit --version

Make sure you import the function

import org.apache.spark.sql.functions._

and then

The function requires the parameter to be Column

public static Column quarter(Column e)
import org.apache.spark.sql.functions._
val test = Seq(("2010-03-05"),("2018-01-16"),("2018-04-20")).toDF("TestDates")
val datetrunctest = test.withColumn("Quarter", quarter($"TestDates"))
0
votes

The same code for the date_trunc started working without any code change. I don't know if there was something wrong with the cluster in the backend but it did not work for half a day and started working in the evening. Weird. If this happens again I will raise a support ticket with the Databricks with the details.

I was also able to figure out a work around solution without using the date_trunc function which is as below if that helps someone for their use case.

import org.apache.spark.sql.functions._
val test = Seq(("2010-03-05"),("2018-01-16"),("2018-04-20"),("2018-08-15"),("2018-11-30")).toDF("TestDates")

val datetrunctest = test.withColumn("Year",year($"TestDates"))
.withColumn("Quarter", quarter($"TestDates"))
.withColumn("QuarterDate", when($"Quarter" === 1, concat($"Year",(lit("-01-01"))))                                                                                                                   
.when($"Quarter" === 2, concat($"Year",(lit("-04-01"))))                                                                                                                
.when($"Quarter" === 3, concat($"Year",(lit("-07-01"))))                                                                                                                    
.when($"Quarter" === 4, concat($"Year",(lit("-10-01")))))
.drop("Year","Quarter")

display(datetrunctest)