I am trying to create a StructType schema from an already existing schema. I have a list which has the fields required for the new schema. The tough part is that schema is of a nested json data with complex fields including ArrayType(StructType). Here is the code for the Schema,
val schema1: Seq[StructField] = Seq(
StructField("playerId", StringType, true),
StructField("playerName", StringType, true),
StructField("playerCountry", StringType, true),
StructField("playerBloodType", StringType, true)
)
val schema2: Seq[StructField] =
Seq(
StructField("PlayerHistory", ArrayType(
StructType(
Seq(
StructField("Rating", StringType, true),
StructField("Height", StringType, true),
StructField("Weight", StringType, true),
StructField("CoachDetails",
StructType(
Seq(
StructField("CoachName", StringType, true),
StructField("Address",
StructType(
Seq(
StructField("AddressLine1", StringType, true),
StructField("AddressLine2", StringType, true),
StructField("CoachCity", StringType, true))), true),
StructField("Suffix", StringType, true))), true),
StructField("GoalHistory", ArrayType(
StructType(
Seq(
StructField("MatchDate", StringType, true),
StructField("NumberofGoals", StringType, true),
StructField("SubstitutionIndicator", StringType, true))), true), true),
StructField("receive_date", DateType, true))
), true
)))
val requiredFields = List("playerId", "playerName", "Rating", "CoachName", "CoachCity", "MatchDate", "NumberofGoals")
val schema: StructType = StructType(schema1 ++ schema2)
The variable schema is the current schema, requiredFields holds the fields we require for the new schema. We also need the parent block in the new schema. The output schema should looks somewhat like this:
val outputSchema =
Seq(
StructField("playerId", StringType, true),
StructField("playerName", StringType, true),
StructField("PlayerHistory",
ArrayType(StructType(
StructField("Rating", StringType, true),
StructField("CoachDetails",
StructType(
StructField("CoachName", StringType, true),
StructField("Address", StructType(
StructField("CoachCity", StringType, true)), true),
StructField("GoalHistory", ArrayType(
StructType(
StructField("MatchDate", StringType, true),
StructField("NumberofGoals", StringType, true)), true), true)))
I have tried approaching the problem in a recursive manner with the following piece of code.
schema.fields.map(f => filterSchema(f, requiredFields)).filter(_.name != "")
def filterSchema(field: StructField, requiredColumns: Seq[String]): StructField = {
field match{
case StructField(_, inner : StructType, _ ,_) => StructField(field.name,StructType(inner.fields.map(f => filterSchema(f, requiredColumns))))
case StructField(_, ArrayType(structType: StructType, _),_,_) =>
if(requiredColumns.contains(field.name))
StructField(field.name, ArrayType(StructType(structType.fields.map(f => filterSchema(f,requiredColumns))),true), true)
else
StructField("",StringType,true)
case StructField(_, _, _, _) => if(requiredColumns.contains(field.name)) field else StructField("",StringType,true)
}
}
However, I am having trouble filtering out the inner structfields.
Feel like there can be some modification for the base condition of the recursive function. Any help here would be highly appreciated. Thanks in advance.