I have UDF
that processes JSON and returns dynamic data results per row. In my case I need this to validate data and return validated data.
The schema is flexible for each row. This means I cannot create case class
for every case (some of my data can be nested).
I've tried to return tuple from my UDF function, but I had no luck in this either (because I needed to convert from list to tuple), and I didn't find an elegant solution for that.
The data types that I'm returning are String
, Integer
, Double
, DateTime
, in different order.
I've tried to use map
on DataFrame, but having issues with my schema.
import spark.implicits._
def processData(row_type: String) = {
/*
completely random output here. Tuple/List/Array of
elements with a type Integer, String, Double, DateType.
*/
// pseudo-code starts here
if row_type == A
(1, "second", 3)
else
(1, "second", 3, 4)
}
val processDataUDF = udf((row_type: String) => processData(row_type))
val df = Seq((0, 1), (1, 2)).toDF("a", "b")
val df2 = df.select(processDataUDF($"a"))
df2.show(5)
df2.printSchema()
Results
+------------+
| UDF(a)|
+------------+
|[1,second,3]|
|[1,second,3]|
+------------+
How how should I approach this problem? We have different processing results per row_type
. All the row_type
's are set dynamically. I can great Schema
for each row_type
, but I cannot make same UDF return results with different schemas.
Is using map
is the only approach here ?