2
votes

I'm writing a UDF in spark SQL and I'm wondering whether there is a place I can read documentation about exactly what is versus what isn't possible in this regard? Or a tutorial? I'm using SQLContext, not HiveContext.

The examples I've seen typically involve passing in a string, transforming it, and then outputting some transformed string of other object which I've managed to do successfully. But what if one wanted to pass in an input that was really some kind of Spark SQL Row object, i.e., or a List of Row objects, each of which have fields with key-value pairs, etc. In my case I'm passing in a List of Row objects by telling the UDF the input is List[Map[String, Any]]. I think the issue is partly that its really some kind of GenericRowWithSchema object not a List or Array.

Also, I noticed the LATERAL VIEW with explode option. I think this would in theory work for my case, but it didn't work for me. I think it may be because I am not using a HiveContext but I can't change that.

1
I have never used HiveContext, and I have no issues using explode to create a LATERAL VIEW. Can you post the actual code you are trying? - David Griffin
the question is what are you trying to achieve. Your question is little vague to understand (and answer) - ayan guha

1 Answers

6
votes

What I got from question is first you want to read a row in UDF

Define the UDF

def compare(r:Row) = {r.get(0)==r.get(1)} 

Register the UDF

sqlContext.udf.register("compare", compare _)

Create the dataFrame

val TestDoc = sqlContext.createDataFrame(Seq(("sachin", "sachin"), ("aggarwal", "aggarwal1"))).toDF("text", "text2")

Use the UDF

scala> TestDoc.select($"text", $"text2",callUdf("compare",struct($"text",$"text2")).as("comparedOutput")).show

Result:

+--------+---------+--------------+
|    text|    text2|comparedOutput|
+--------+---------+--------------+
|  sachin|   sachin|          true|
|aggarwal|aggarwal1|         false|
+--------+---------+--------------+

Second question is about LATERAL VIEW with explode option, better to use HiveContext