0
votes

I have dataframe with below schema. I want all the columns including the nested fields should be sorted alphabetically. I want it in scala spark.

root
 |-- metadata2: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- attribute2: string (nullable = true)
 |    |    |-- attribute1: string (nullable = true)
 |-- metadata3: string (nullable = true)
 |-- metadata1: struct (containsNull = true)
 |    |-- attribute2: string (nullable = true)
 |    |-- attribute1: string (nullable = true)

when I sort using schema.sortBy(_.name), I get below schema(the nested array and struct type fields are not sorted)

root
 |-- metadata1: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- attribute2: string (nullable = true)
 |    |    |-- attribute1: string (nullable = true)
 |-- metadata2: struct (containsNull = true)
 |    |-- attribute2: string (nullable = true)
 |    |-- attribute1: string (nullable = true)
 |-- metadata3: string (nullable = true)

The schema which I want is as below. (Even the columns inside the metadata1(ArrayType) and metadata2(StructType) should be sorted)

root
 |-- metadata1: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- attribute1: string (nullable = true)
 |    |    |-- attribute2: string (nullable = true)
 |-- metadata2: struct (containsNull = true)
 |    |-- attribute1: string (nullable = true)
 |    |-- attribute2: string (nullable = true)
 |-- metadata3: string (nullable = true)

Thanks in advance.

1

1 Answers

0
votes

Version to StructType:

import spark.implicits._
import org.apache.spark.sql.types.{ArrayType, StringType, StructField, StructType}

val schema = StructType(Seq(
  StructField("metadata2",       StructType(
    Seq(StructField("attribute2", StringType),
      StructField("attribute1", StringType)))),
  StructField("metadata3", StringType),
  StructField("metadata1", ArrayType(StringType)
  )
))

schema.foreach(println _)
//  StructField(metadata2,StructType(StructField(attribute2,StringType,true), StructField(attribute1,StringType,true)),true)
//  StructField(metadata3,StringType,true)
//  StructField(metadata1,ArrayType(StringType,true),true)


val schemaResult = schema.sortBy(_.name).map{c =>
  c.dataType match {
    case structType: StructType => StructField(c.name, StructType(structType.fields.sortBy(_.name)))
    case _ => c
  }
}

schemaResult.foreach(println _)
//  StructField(metadata1,ArrayType(StringType,true),true)
//  StructField(metadata2,StructType(StructField(attribute1,StringType,true), StructField(attribute2,StringType,true)),true)
//  StructField(metadata3,StringType,true)
println(schemaResult)
//  List(StructField(metadata1,ArrayType(StringType,true),true), StructField(metadata2,StructType(StructField(attribute1,StringType,true), StructField(attribute2,StringType,true)),true), StructField(metadata3,StringType,true))