Spark scala data frame udf returning rows

Question

Say I have an dataframe which contains a column (called colA) which is a seq of row. I want to to append a new field to each record of colA. (And the new filed is associated with the former record, so I have to write an udf.) How should I write this udf?

I have tried to write a udf, which takes colA as input, and output Seq[Row] where each record contains the new filed. But the problem is the udf cannot return Seq[Row]/ The exception is 'Schema for type org.apache.spark.sql.Row is not supported'. What should I do?

The udf that I wrote: val convert = udf[Seq[Row], Seq[Row]](blablabla...) And the exception is java.lang.UnsupportedOperationException: Schema for type org.apache.spark.sql.Row is not supported

If your final combined row is fixed then create a case class and use that. You will have to give us more information for detailed answer — Ramesh Maharjan

Raphael Roth Raphael Roth · Accepted Answer · 2018-04-08T05:13:30

since spark 2.0 you can create UDFs which return Row / Seq[Row], but you must provide the schema for the return type, e.g. if you work with an Array of Doubles :

val schema = ArrayType(DoubleType)

val myUDF = udf((s: Seq[Row]) => {
  s // just pass data without modification
}, schema)

But I cant really imagine where this is useful, I would rather return tuples or case classes (or Seq thereof) from the UDFs.

EDIT : It could be useful if your row contains more than 22 fields (limit of fields for tuples/case classes)

Spark scala data frame udf returning rows

1 Answers