I am trying to create a StructType schema to pass to the from_json API in order to parse a column stored as a JSON string. The JSON data contains a Map that has String keys and values of type struct, but the schema of each struct depends on the key.
Consider this JSON example where the "data" column is a Map with values name and address and the schema of each value is different:
{
"data": {
"name": {
"first": "john"
},
"address": {
"street": "wall",
"zip": 10000
}
}
}
For key "name", the struct value has a single member field "first". For key "address", the struct value has two member fields "street" and "zip".
Can the "data" column be represented as a MapType[StringType, StructType] in a Spark dataframe?
- Does Spark handle a Map[String, Struct] where the structs are non-homogeneous?
- If yes, please share an example of a StructType schema representing a dataframe with schema
MapType<String, StructType>where the StructType is non-homogeneous.
EDIT: To add another example of such data which has a Map[String, Struct] where the Struct is not of the same schema throughout the values of the Map, consider the following:
case class Address(street: String, zip: Int)
case class Name(first: String)
case class Employee(id: String, role: String)
val map = Map(
"address" -> Address("wall", 10000),
"name" -> Name("john"),
"occupation" -> Employee("12345", "Software Engineer")
)
As you can see, the values of the map differ in their schema - Address, Name and Employee are all different case classes and their member fields are also different.
You can think of this kind of data coming from a JSON file where a map is allowed to have any arbitrary type of value across keys, and there is no restriction on all the values being of the same type. In my case I will have values that are all structs but the schema of each struct depends on the map key.