Higher Order functions in Spark SQL

Question

Can anyone please explain the transform() and filter() in Spark Sql 2.4 with some advanced real-world use-case examples ?

In a sql query, is this only to be used with array columns or it can also be applied to any column type in general. It would be great if anyone could demonstrate with a sql query for an advanced application.

Thanks in advance.

Please check - spark.apache.org/docs/latest/api/sql/index.html#transform & spark.apache.org/docs/latest/api/sql/index.html#filter, it has minimal description and example to understand the functionality — Som
@thebluephantom i had checked online earlier. I found the basic examples which helped me to get started on this function. However, (as I had mentioned in my question) I was looking for its advanced applications that would demonstrate its capabilities fully - For e.g. how to use more than 1 variables as part of the lambda function, nesting the function etc.. It'll be really helpful if you can point me in that direction please. — dexter80

thebluephantom thebluephantom · Accepted Answer · 2020-08-03T18:00:03

Not going down the .filter road as I cannot see the focus there.

For .transform

dataframe transform at DF-level
transform on an array of a DF in v 2.4
transform on an array of a DF in v 3

The following:

dataframe transform

From the official docs https://kb.databricks.com/data/chained-transformations.html transform on DF can end up like spaghetti. Opinion can differ here.

This they say is messy:

...
def inc(i: Int) = i + 1

val tmp0 = func0(inc, 3)(testDf) 
val tmp1 = func1(1)(tmp0) 
val tmp2 = func2(2)(tmp1) 
val res = tmp2.withColumn("col3", expr("col2 + 3"))

compared to:

val res = testDf.transform(func0(inc, 4))
                .transform(func1(1))
                .transform(func2(2))
                .withColumn("col3", expr("col2 + 3"))

transform with lambda function on an array of a DF in v 2.4 which needs the select and expr combination

import org.apache.spark.sql.functions._
val df = Seq(Seq(Array(1,999),Array(2,9999)),  
         Seq(Array(10,888),Array(20,8888))).toDF("c1")
val df2 = df.select(expr("transform(c1, x -> x[1])").as("last_vals"))

transform with lambda function new array function on a DF in v 3 using withColumn

import org.apache.spark.sql.functions._ import org.apache.spark.sql._

val df = Seq(
             (Array("New York", "Seattle")),
             (Array("Barcelona", "Bangalore"))
             ).toDF("cities")
val df2 = df.withColumn("fun_cities", transform(col("cities"), 
                        (col: Column) => concat(col, lit(" is fun!"))))

Try them.

Final note and excellent point raised (from https://mungingdata.com/spark-3/array-exists-forall-transform-aggregate-zip_with/):

transform works similar to the map function in Scala. I’m not sure why they chose to name this function transform… I think array_map would have been a better name, especially because the Dataset#transform function is commonly used to chain DataFrame transformations.

Update

If wanting to use %sql or display approach for Higher Order Functions, then consult this: https://docs.databricks.com/delta/data-transformation/higher-order-lambda-functions.html

Higher Order functions in Spark SQL

1 Answers