Spark fails to write and then read JSON formatted data with nullable column

Question

I'm trying to set up spark with a new project, and I have some case classes generated from schemas elsewhere in my company I want to use as a template to read/write in a variety of formats (parquet and json)

I'm noticing an issue in json with one of our fields, which is an Option[String]. The corresponding data is usually null, but is sometimes not. When I'm testing with subsets of this data, there's a decent chance that all the rows have null in this column. Spark seems to detect that and leave out the column for any rows that have a null for this data.

When I'm reading, as long as any single row has the corresponding data, spark picks up the schema and can translate it back to the case class just fine. But if none of them are there, spark sees a missing column and fails.

Here's some code that demonstrates this.

import org.apache.spark.sql.SparkSession

object TestNulls {
  case class Test(str: Option[String])
  def main(args: Array[String]) {
    val spark: SparkSession = SparkSession
      .builder()
      .getOrCreate()
    import spark.implicits._

    val dataset = Seq(
      Test(None),
      Test(None),
      Test(None)
    ).toDS()

    // Because all rows are null, writes {} for all rows
    dataset.write.json("testpath")

    // Fails because column `test` does not exist, even though it is an option
    spark.read.json("testpath").as[Test].show()
  }
}

Is there a way to tell spark not to fail on a missing nullable column? Failing that, is there a human readable format I can use that won't exhibit this behavior? The json is mainly so that we can write human-readabale files for testing and local development cases

Mikel San Vicente Mikel San Vicente · Accepted Answer · 2018-08-27T15:57:17

You can use the case class to extract the schema from the Encoder and then pass it when you read

val schema = implicitly[Encoder[Test]].schema
spark.read.schema(schema).json("testpath")

Spark fails to write and then read JSON formatted data with nullable column

1 Answers