0
votes

I am trying to use UDF's and return ListBuffer as a column from UDF, i am getting error.

I have created Df by executing below code:

val df = Seq((1,"dept3@@rama@@kumar","dept3##rama#@kumar"), (2,"dept31@@rama1##kumar1","dept33##rama3#@kumar3")).toDF("id","str1","str2")
df.show()

it show like below:

+---+--------------------+--------------------+
| id|                str1|                str2|
+---+--------------------+--------------------+
|  1|  dept3@@rama@@kumar|  dept3##rama#@kumar|
|  2|dept31@@rama1##ku...|dept33##rama3#@ku...|
+---+--------------------+--------------------+

as per my requirement i have to use i have to split the above columns based some inputs so i have tried UDF like below :

    def appendDelimiterError=udf((id: Int, str1: String, str2: String)=> {
            var lit = new ListBuffer[Any]()
            if(str1.contains("@@"){val a=str1.split("@@")}
            else if(str1.contains("##"){val a=str1.split("##")}
            else if(str1.contains("#&"){val a=str1.split("#&")}
            if(str2.contains("@@"){ val b=str2.split("@@")}
            else if(str2.contains("##"){ val b=str2.split("##") }
            else if(str1.contains("#@"){val b=str2.split("#@")}
            var tmp_row = List(a,"test1",b)
            lit +=tmp_row 
return lit 
})

val

i try to cal by executing below code:

val df1=df.appendDelimiterError("newcol",appendDelimiterError(df("id"),df("str1"),df("str2"))

i getting error "this was a bad call" .i want use ListBuffer/list to store and return to calling place.

my expected output will be:

+---+--------------------+------------------------+----------------------------------------------------------------------+ 
| id|                str1|                str2    |                               newcol                                 |
+---+--------------------+------------------------+----------------------------------------------------------------------+
|  1|  dept3@@rama@@kumar|  dept3##rama#@kumar    |ListBuffer(List("dept","rama","kumar"),List("dept3","rama","kumar"))  |
|  2|dept31@@rama1##kumar1|dept33##rama3#@kumar3  | ListBuffer(List("dept31","rama1","kumar1"),List("dept33","rama3","kumar3")) |                                          
+---+--------------------+------------------------+----------------------------------------------------------------------+

How to achieve this?

1
I think my way is easier and more flexible on splitting.thebluephantom
Why would you need ListBuffer? Because I think you are trying to build the return value. No need.thebluephantom
hi @ thebluephantom thanks for quick reply.i have to use udf i have some other logic before splitting,but as you provided answer without UDf .please help me try to use UDF and and LisfBuffer or a string like "111, cat, 666,@SAPRATE,222, fritz, 777"Sai
No need to. No idea why Listbuffer needed. The logic was incorrect. My contribution ends here, functional programming.thebluephantom
I note no other answer. Did you resolve?thebluephantom

1 Answers

0
votes

An alternative with my own fictional data to which you can tailor and no UDF:

import org.apache.spark.sql.functions.{col, udf}  
import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions._

val df = Seq(
  (1, "111@#cat@@666", "222@@fritz@@777"),
  (2, "AAA@@cat@@555", "BBB@@felix@@888"),
  (3, "HHH@@mouse@@yyy", "123##mickey#@ZZZ") 
 ).toDF("c0", "c1", "c2")

 val df2 = df.withColumn( "c_split", split(col("c1"), ("(@#)|(@@)|(##)|(#@)")  ))
          .union(df.withColumn("c_split", split(col("c2"), ("(@#)|(@@)|(##)|(#@)")  )) )
 df2.show(false)
 df2.printSchema()


 val df3 = df2.groupBy(col("c0")).agg(collect_list(col("c_split")).as("List_of_Data") )   
 df3.show(false)
 df3.printSchema()

Gives answer but no ListBuffer - really necessary?, as follows:

+---+---------------+----------------+------------------+
|c0 |c1             |c2              |c_split           |
+---+---------------+----------------+------------------+
|1  |111@#cat@@666  |222@@fritz@@777 |[111, cat, 666]   |
|2  |AAA@@cat@@555  |BBB@@felix@@888 |[AAA, cat, 555]   |
|3  |HHH@@mouse@@yyy|123##mickey#@ZZZ|[HHH, mouse, yyy] |
|1  |111@#cat@@666  |222@@fritz@@777 |[222, fritz, 777] |
|2  |AAA@@cat@@555  |BBB@@felix@@888 |[BBB, felix, 888] |
|3  |HHH@@mouse@@yyy|123##mickey#@ZZZ|[123, mickey, ZZZ]|
+---+---------------+----------------+------------------+

root
 |-- c0: integer (nullable = false)
 |-- c1: string (nullable = true)
 |-- c2: string (nullable = true)
 |-- c_split: array (nullable = true)
 |    |-- element: string (containsNull = true)

 +---+---------------------------------------+
 |c0 |List_of_Data                           |
 +---+---------------------------------------+
 |1  |[[111, cat, 666], [222, fritz, 777]]   |
 |3  |[[HHH, mouse, yyy], [123, mickey, ZZZ]]|
 |2  |[[AAA, cat, 555], [BBB, felix, 888]]   |
 +---+---------------------------------------+

 root
  |-- c0: integer (nullable = false)
  |-- List_of_Data: array (nullable = true)
  |    |-- element: array (containsNull = true)
  |    |    |-- element: string (containsNull = true)