spark convert dataframe to dataset using case class with option fields

Question

I have the following case class:

case class Person(name: String, lastname: Option[String] = None, age: BigInt) {}

And the following json:

{ "name": "bemjamin", "age" : 1 }

When I try to transform my dataframe into a dataset:

spark.read.json("example.json")
  .as[Person].show()

It shows me the following error:

Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve 'lastname' given input columns: [age, name];

My question is: If my schema is my case class and it defines that the lastname is optional, shouldn't the as() do the conversion?

I can easily fix this using a .map but I would like to know if there is another cleaner alternative to this.

Hi, If you add another json (record 2) with lastname in it, If it works, then I think it infers the schema to three columns. With your example of only one record, how does spark( or anyone) know that you are intending to have 3 columns? — C.S.Reddy Gadipally

KZapagol KZapagol · Accepted Answer · 2019-03-12T07:32:50

We have one more option to solve above issue.There are 2 steps required

Make sure that fields that can be missing are declared as nullable Scala types(like Option[_]).
Provide a schema argument and not depend on schema inference.You can use for example use Spark SQL Encoder:
```
import org.apache.spark.sql.Encoders

val schema = Encoders.product[Person].schema
```

You can update code as below.

val schema = Encoders.product[Person].schema

val df = spark.read
           .schema(schema)
           .json("/Users/../Desktop/example.json")
           .as[Person]

+--------+--------+---+
|    name|lastname|age|
+--------+--------+---+
|bemjamin|    null|  1|
+--------+--------+---+

spark convert dataframe to dataset using case class with option fields

2 Answers