0
votes

I have data with schema :

DummyData
 |-- a: string (nullable = true)
 |-- b: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- b1: string (nullable = true)
 |    |    |-- b2: string (nullable = true)
 |-- c: long (nullable = true)

With case class defined as :

case class DummyData (a : String, b : List[DummyDataChild], c : Long)
case class DummyDataChild (b1 : String, b2 : String)

When I try to read this data in Dataframe the child case class is being read as GenericRowWithSchema and not the actual case calss which it is expected to(DummyDataChild in this scenario), is there anyway to read the nested child object also as a case class in scala spark?

P.S. : I know we can extract the fields from GenericRowWithSchema class, as a workaround of this.

1

1 Answers

0
votes

You should be able to use the built in encoders for case classes:

val ds = df.as[DummyData]
df.printSchema
ds.printSchema

Output:

ds: org.apache.spark.sql.Dataset[DummyData] = [a: string, b: array<struct<b1:string,b2:string>> ... 1 more field]

root
 |-- a: string (nullable = true)
 |-- b: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- b1: string (nullable = true)
 |    |    |-- b2: string (nullable = true)
 |-- c: long (nullable = false)

root
 |-- a: string (nullable = true)
 |-- b: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- b1: string (nullable = true)
 |    |    |-- b2: string (nullable = true)
 |-- c: long (nullable = false)