Spark Dataframe: Representing Schema of MapType with non homogeneous data types in StructType values

Question

I am trying to create a StructType schema to pass to the from_json API in order to parse a column stored as a JSON string. The JSON data contains a Map that has String keys and values of type struct, but the schema of each struct depends on the key.

Consider this JSON example where the "data" column is a Map with values name and address and the schema of each value is different:

{
  "data": {
    "name": {
      "first": "john"
    },
    "address": {
      "street": "wall",
      "zip": 10000
    }
  }
}

For key "name", the struct value has a single member field "first". For key "address", the struct value has two member fields "street" and "zip".

Can the "data" column be represented as a MapType[StringType, StructType] in a Spark dataframe?

Does Spark handle a Map[String, Struct] where the structs are non-homogeneous?
If yes, please share an example of a StructType schema representing a dataframe with schema MapType<String, StructType> where the StructType is non-homogeneous.

EDIT: To add another example of such data which has a Map[String, Struct] where the Struct is not of the same schema throughout the values of the Map, consider the following:

case class Address(street: String, zip: Int)
case class Name(first: String)
case class Employee(id: String, role: String)
val map = Map(
  "address" -> Address("wall", 10000),
  "name" -> Name("john"),
  "occupation" -> Employee("12345", "Software Engineer")
)

As you can see, the values of the map differ in their schema - Address, Name and Employee are all different case classes and their member fields are also different.

You can think of this kind of data coming from a JSON file where a map is allowed to have any arbitrary type of value across keys, and there is no restriction on all the values being of the same type. In my case I will have values that are all structs but the schema of each struct depends on the map key.

Could you please share more data sample? I'm trying to figure out what kind of non-homogeneous your data is — Kafels
I did an answer, please check it and in case of any error of Scala pattern, fix it for me, because this is the first time that I made a code in Scala — Kafels

Kafels Kafels · Accepted Answer · 2021-05-13T15:30:17

You can read your JSON column and parse the schema dynamically:

import org.apache.spark.sql.functions.{col, from_json}
import spark.implicits._


val df = sc.parallelize(Seq(
  ("""{"data":{"name":{"first":"john"},"address":{"street":"wall","zip":10000},"occupation":{"id":"12345","role":"Software Engineer"}}}"""),
  ("""{"data":{"name":{"first":"john"},"address":{"street":"wall","zip":10000}}}"""),
)).toDF("my_json_column")

val rows = df.select("my_json_column").as[String]
val schema = spark.read.json(rows).schema

// Transforming your String to Struct
val newDF = df.withColumn("obj", from_json(col("my_json_column"), schema))

newDF.printSchema
// root
//  |-- my_json_column: string (nullable = true)
//  |-- obj: struct (nullable = true)
//  |    |-- data: struct (nullable = true)
//  |    |    |-- address: struct (nullable = true)
//  |    |    |    |-- street: string (nullable = true)
//  |    |    |    |-- zip: long (nullable = true)
//  |    |    |-- name: struct (nullable = true)
//  |    |    |    |-- first: string (nullable = true)
//  |    |    |-- occupation: struct (nullable = true)
//  |    |    |    |-- id: string (nullable = true)
//  |    |    |    |-- role: string (nullable = true)

newDF.select("obj.data", "obj.data.occupation.id").show(false)

Output

+---------------------------------------------------+-----+
|data                                               |id   |
+---------------------------------------------------+-----+
|{{wall, 10000}, {john}, {12345, Software Engineer}}|12345|
|{{wall, 10000}, {john}, null}                      |null |
+---------------------------------------------------+-----+

Spark Dataframe: Representing Schema of MapType with non homogeneous data types in StructType values

1 Answers