1
votes

I'm trying to modify a Dataframe which is generated by an external library. I receive a Dataframe with this schema:

root
 |-- child: struct (nullable = true)
 |    |-- child_id: long (nullable = true)

I would like to wrap the child struct above into an Array, as shown in the box below.

root
 |-- child: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- child_id: long (nullable = true)

I tried to define an UDF:

//the two lines below are an example, in real i get the Dataframe from an  external library. 
val seq = sc.parallelize(Seq("""{ "child": { "child_id": 1}}"""))
val df = sqlContext.read.json(seq)

val myUDF = udf((x: Row) => Array(x))
val df2 = df.withColumn("children",myUDF($"child"))

But i get an exception: "Schema for type org.apache.spark.sql.Row is not supported"

I'm working with Spark 2.1.1.

The real DataFrame is very complex, is there a solution that allow modifying the schema without listing the name or the position of the fields in the child table? For the same reason I also would rather not to map to explicit case classes.

Thank you in advance for any help!

1

1 Answers

4
votes

You can use array inbuilt function to get your desired result as

import org.apache.spark.sql.functions._
val df2 = df.withColumn("child", array("child"))

this will update the same column, if you want it in separate column then do

import org.apache.spark.sql.functions._
val df2 = df.withColumn("children", array("child"))