spark dataset from json with inner array

Question

I'm trying to read json into dataset (spark 2.1.1). Unfortunately it doesn't work. And fails with:

Caused by: java.lang.NullPointerException: Null value appeared in non-
nullable field:
- field (class: "scala.Long", name: "age")

Any ideas what am I doing wrong ?

case class Owner(id: String, pets: Seq[Pet])
case class Pet(name: String, age: Long)

val sampleJson = """{"id":"kotek", "pets":[{"name":"miauczek", 
"age":18}, {"name":"miauczek2", "age":9}]}"""

val session = SparkSession.builder().master("local").getOrCreate()
import session.implicits._

val rdd = session.sparkContext.parallelize(Seq(sampleJson))
val ds = session.read.json(rdd).as[Owner].collect()

I believe this is a bug in spark. If I understand correctly what is happening here. spark is NOT mapping by name for that inner type ("pets"). And he is using sorted order to map those attributes ? So pets.age gets mapped to Pet.name, and while trying to map pets.name -> Pet.age he fails with exception. Anyone can confirm that my understanding is correct and this is spark bug ? — Pawel Niezgoda

user8371915 user8371915 · Accepted Answer · 2017-08-10T09:00:51

Usually, if some field can be missing use either Option:

case class Owner(id: String, pets: Seq[Pet])
case class Pet(name: String, age: Option[Long])

or nullable type:

case class Owner(id: String, pets: Seq[Pet])
case class Pet(name: String, age: java.lang.Long)

But this one indeed looks like a bug. I tested this ins Spark 2.2 and it has been resolved by now. I think that quick workaround is to keep fields sorted by name:

case class Owner(id: String, pets: Seq[Pet])
case class Pet(age: java.lang.Long, name: String)

spark dataset from json with inner array

1 Answers