scala/spark use udf function in spark shell for array manipulation in dataframe column
df.printSchema
root
|-- x: timestamp (nullable = true)
|-- date_arr: array (nullable = true)
| |-- element: timestamp (containsNull = true)
sample data:
|x | date_arr |
|---------------------- |---------------------------------------------------------------------- |
| 2009-10-22 19:00:00.0 | [2009-08-22 19:00:00.0, 2009-09-19 19:00:00.0, 2009-10-24 19:00:00.0] |
| 2010-10-02 19:00:00.0 | [2010-09-25 19:00:00.0, 2010-10-30 19:00:00.0] |
in udf.jar, I have this function to get ceiling date in date_arr according to x:
class CeilToDate extends UDF {
def evaluate(arr: Seq[Timestamp], x: Timestamp): Timestamp = {
arr.filter(_.before(x)).last
}
}
add jar to spark shell: spark-shell --jars udf.jar
in spark shell, I have HiveContext as val hc = new HiveContext(spc)
, and create function: hc.sql("create temporary function ceil_to_date as 'com.abc.udf.CeilToDate'")
when I make a query: hc.sql("select ceil_to_date(date_arr, x) as ceildate from df").show
, expecting to have a dataframe like:
|ceildate |
|----------------------|
|2009-09-19 19:00:00.0 |
|2010-09-25 19:00:00.0 |
however, it throws this error:
org.apache.spark.sql.AnalysisException: No handler for Hive udf class com.abc.udf.CeilToDate because: No matching method for class com.abc.udf.CeilToDate with (array, timestamp). Possible choices: FUNC(struct<>, timestamp)