How to pass whole Row to UDF - Spark DataFrame filter

Question

I'm writing filter function for complex JSON dataset with lot's of inner structures. Passing individual columns is too cumbersome.

So I declared the following UDF:

val records:DataFrame = = sqlContext.jsonFile("...")
def myFilterFunction(r:Row):Boolean=???
sqlc.udf.register("myFilter", (r:Row)=>myFilterFunction(r))

Intuitively I'm thinking it will work like this:

records.filter("myFilter(*)=true")

What is the actual syntax?

Could you specify your filter function a bit more? Using Row throws away a lot of optimizations a DataFrame does for you. — Reactormonk
The filter is pretty complex. The structure of the record is several Map fields with bunch of key-value pairs in them — Michael Zeltser

agsachin agsachin · Accepted Answer · 2015-12-16T08:23:01

You have to use struct() function for constructing the row while making a call to the function, follow these steps.

Import Row,

import org.apache.spark.sql._

Define the UDF

def myFilterFunction(r:Row) = {r.get(0)==r.get(1)}

Register the UDF

sqlContext.udf.register("myFilterFunction", myFilterFunction _)

Create the dataFrame

val records = sqlContext.createDataFrame(Seq(("sachin", "sachin"), ("aggarwal", "aggarwal1"))).toDF("text", "text2")

Use the UDF

records.filter(callUdf("myFilterFunction",struct($"text",$"text2"))).show

When u want all columns to be passed to UDF.

records.filter(callUdf("myFilterFunction",struct(records.columns.map(records(_)) : _*))).show

Result:

+------+------+
|  text| text2|
+------+------+
|sachin|sachin|
+------+------+

How to pass whole Row to UDF - Spark DataFrame filter

4 Answers