Trying to run:
val outputDF = hiveContext.createDataFrame(myRDD, schema)
Getting this error:
Caused by: java.lang.RuntimeException: scala.Tuple2 is not a valid external type for schema of struct<col1name:string,col2name:string>
myRDD.take(5).foreach(println)
[string number,[Lscala.Tuple2;@163601a5]
[1234567890,[Lscala.Tuple2;@6fa7a81c]
data of the RDD:
RDD[Row]: [string number, [(string key, string value)]]
Row(string, Array(Tuple(String, String)))
where the tuple2 contains data like this:
(string key, string value)
schema:
schema:
root
|-- col1name: string (nullable = true)
|-- col2name: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- col3name: string (nullable = true)
| | |-- col4name: string (nullable = true)
StructType(
StructField(col1name,StringType,true),
StructField(col2name,ArrayType(
StructType(
StructField(col3name,StringType,true),
StructField(col4name,StringType,true)
),
true
),
true
)
)
This code was used to run in spark 1.6 before and didn't have problems. In spark 2.4, it appears that tuple2 doesn't count as a Struct Type? In that case, what should it be changed to?
I'm assuming the easiest solution would be to adjust the schema to suite the data.
Let me know if more details are needed