Schema error using hiveContext.createDataFrame from an RDD [scala spark 2.4]

Question

Trying to run: val outputDF = hiveContext.createDataFrame(myRDD, schema)

Getting this error: Caused by: java.lang.RuntimeException: scala.Tuple2 is not a valid external type for schema of struct<col1name:string,col2name:string>

myRDD.take(5).foreach(println)

[string number,[Lscala.Tuple2;@163601a5]
[1234567890,[Lscala.Tuple2;@6fa7a81c]

data of the RDD:

RDD[Row]: [string number, [(string key, string value)]]
Row(string, Array(Tuple(String, String)))

where the tuple2 contains data like this:

(string key, string value)

schema:

schema:
root
 |-- col1name: string (nullable = true)
 |-- col2name: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- col3name: string (nullable = true)
 |    |    |-- col4name: string (nullable = true)

StructType(
    StructField(col1name,StringType,true), 
    StructField(col2name,ArrayType(
        StructType(
            StructField(col3name,StringType,true), 
            StructField(col4name,StringType,true)
            ),
        true
        ),
    true
    )
)

This code was used to run in spark 1.6 before and didn't have problems. In spark 2.4, it appears that tuple2 doesn't count as a Struct Type? In that case, what should it be changed to?

I'm assuming the easiest solution would be to adjust the schema to suite the data.

Let me know if more details are needed

NLI NLI · Accepted Answer · 2021-02-03T22:37:37

The answer to this is changing the tuple type that contained the 2 string types to a row containing the 2 string types instead.

So for the provided schema, the incoming data structure was

Row(string, Array(Tuple(String, String)))

This was changed to

Row(string, Array(Row(String, String)))

in order to continue using the same schema.

Schema error using hiveContext.createDataFrame from an RDD [scala spark 2.4]

1 Answers