How to get Schema as a Spark Dataframe from a Nested Structured Spark DataFrame

Question

I have a sample Dataframe that I create using below code

val data = Seq(
  Row(20.0, "dog"),
  Row(3.5, "cat"),
  Row(0.000006, "ant")
)

val schema = StructType(
  List(
    StructField("weight", DoubleType, true),
    StructField("animal_type", StringType, true)
  )
)

val df = spark.createDataFrame(
  spark.sparkContext.parallelize(data),
  schema
)

val actualDF = df.withColumn(
  "animal_interpretation",
  struct(
    (col("weight") > 5).as("is_large_animal"),
    col("animal_type").isin("rat", "cat", "dog").as("is_mammal")
  )
)

actualDF.show(false)

+------+-----------+---------------------+
|weight|animal_type|animal_interpretation|
+------+-----------+---------------------+
|20.0  |dog        |[true,true]          |
|3.5   |cat        |[false,true]         |
|6.0E-6|ant        |[false,false]        |
+------+-----------+---------------------+

The schema of this Spark DF can be printed using -

scala> actualDF.printSchema
root
 |-- weight: double (nullable = true)
 |-- animal_type: string (nullable = true)
 |-- animal_interpretation: struct (nullable = false)
 |    |-- is_large_animal: boolean (nullable = true)
 |    |-- is_mammal: boolean (nullable = true)

However, I would like to get this schema in the form of a dataframe that has 3 columns - field, type, nullable. The output dataframe from the schema would something like this -

+-------------------------------------+--------------+--------+
|field                                |type          |nullable|
+-------------------------------------+--------------+--------+
|weight                               |double        |true    |        
|animal_type                          |string        |true    |       
|animal_interpretation                |struct        |false   |
|animal_interpretation.is_large_animal|boolean       |true    |
|animal_interpretation.is_mammal      |boolean       |true    |     
+----------------------------------------------------+--------+

How can I achieve this in Spark. I am using Scala for coding.

koiralo koiralo · Accepted Answer · 2019-07-16T20:12:46

You could do something like this

def flattenSchema(schema: StructType, prefix: String = null) : Seq[(String, String, Boolean)] = {
  schema.fields.flatMap(field => {
    val col = if (prefix == null) field.name else (prefix + "." + field.name)
    field.dataType match {
      case st: StructType => flattenSchema(st, col)
      case _ => Array((col, field.dataType.simpleString, field.nullable))
    }
  })
}

flattenSchema(actualDF.schema).toDF("field", "type", "nullable").show()

Hope this helps!

How to get Schema as a Spark Dataframe from a Nested Structured Spark DataFrame

2 Answers