I'm trying to modify a Dataframe which is generated by an external library. I receive a Dataframe with this schema:
root
|-- child: struct (nullable = true)
| |-- child_id: long (nullable = true)
I would like to wrap the child struct above into an Array, as shown in the box below.
root
|-- child: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- child_id: long (nullable = true)
I tried to define an UDF:
//the two lines below are an example, in real i get the Dataframe from an external library.
val seq = sc.parallelize(Seq("""{ "child": { "child_id": 1}}"""))
val df = sqlContext.read.json(seq)
val myUDF = udf((x: Row) => Array(x))
val df2 = df.withColumn("children",myUDF($"child"))
But i get an exception: "Schema for type org.apache.spark.sql.Row is not supported"
I'm working with Spark 2.1.1.
The real DataFrame is very complex, is there a solution that allow modifying the schema without listing the name or the position of the fields in the child table? For the same reason I also would rather not to map to explicit case classes.
Thank you in advance for any help!