I am using Spark 1.5.0 and I have this issue:
val df = paired_rdd.reduceByKey {
case (val1, val2) => val1 + "|" + val2
}.toDF("user_id","description")
Here is sample data for df, as you can see the column description has
this format (text1#text3#weight | text1#text3#weight|....)
user1
book1#author1#0.07841217886795074|tool1#desc1#0.27044260397331488|song1#album1#-0.052661673730870676|item1#category1#-0.005683148395350108
I want to sort this df based on weight in descending order here is what I tried:
First split the contents at "|" and then for each of those strings, split them at "#" and get the 3rd string which is weight and then convert that into a double value
val getSplitAtWeight = udf((str: String) => {
str.split("|").foreach(_.split("#")(2).toDouble)
})
Sort based on the weigh value returned by the udf (in descending manner)
val df_sorted = df.sort(getSplitAtWeight(col("description")).desc)
I get the following error:
Exception in thread "main" java.lang.UnsupportedOperationException: Schema for type Unit is not supported at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:153) at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:29) at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:64) at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:29) at org.apache.spark.sql.functions$.udf(functions.scala:2242)