3
votes

I have dataframe with 2 ArrayType columns. I want to find the difference between columns. column1 will always have values while column2 may have empty array. I created following udf but it is not working

df.show() gives following records

SampleData:

["Test", "Test1","Test3", "Test2"], ["Test", "Test1"]

Code:

sc.udf.register("diff", (value: Column,value1: Column)=>{ 
                        value.asInstanceOf[Seq[String]].diff(value1.asInstanceOf[Seq[String]])          
                    })  

Output:

["Test2","Test3"]

Spark version 1.4.1 Any help will be appreciated.

2
what was the result ?Ram Ghadiyaram
it gives all values of valueundefined_variable
can you paste sample data pls? ideally it should workRam Ghadiyaram
I hope you have used collection.SeqLike.diffRam Ghadiyaram
Please share example data and expected output.mtoto

2 Answers

1
votes

You need to change your udf to:

val diff_udf = udf { ( a:  Seq[String], 
                       b:  Seq[String]) => a diff b }

Then this works:

import org.apache.spark.sql.functions.col
df.withColumn("diff",
  diff_udf(col("col1"), col("col2"))).show
+--------------------+-----------------+------------------+
|                col1|             col2|              diff|
+--------------------+-----------------+------------------+
|List(Test, Test1,...|List(Test, Test1)|List(Test3, Test2)|
+--------------------+-----------------+------------------+

Data

val df = sc.parallelize(Seq((List("Test", "Test1","Test3", "Test2"), 
                             List("Test", "Test1")))).toDF("col1", "col2")
2
votes

column1 will always have values while column2 may have empty array.

your comment : it gives all values of value – undefined_variable

Example1 :

lets see small example like this...

   val A = Seq(1,1)

 A: Seq[Int] = List(1, 1)

 val B = Seq.empty

 B: Seq[Nothing] = List()
    
A diff B

 res0: Seq[Int] = List(1, 1)

if you do a collection.SeqLike.diff then you will get A value as shown in example. As per scala, this is very much valid case since you told you are always getting value which is seq.

Also, reverse case is like this...

 B diff A

 res1: Seq[Nothing] = List()

if you use Spark udf for doing above as well then same results will come.

EDIT : (if one array not empty case as you modified your example )

Example2 :

 val p = Seq("Test", "Test1","Test3", "Test2")

 p: Seq[String] = List(Test, Test1, Test3, Test2)

 val q = Seq("Test", "Test1")

 q: Seq[String] = List(Test, Test1)

 p diff q

 res2: Seq[String] = List(Test3, Test2)

This is what your expected output which is coming as given in your example.

Reverse case : I think this is what you are getting which is not expected by you.

q diff p

 res3: Seq[String] = List()