I'm trying to set up spark with a new project, and I have some case classes generated from schemas elsewhere in my company I want to use as a template to read/write in a variety of formats (parquet and json)
I'm noticing an issue in json with one of our fields, which is an Option[String]. The corresponding data is usually null, but is sometimes not. When I'm testing with subsets of this data, there's a decent chance that all the rows have null in this column. Spark seems to detect that and leave out the column for any rows that have a null for this data.
When I'm reading, as long as any single row has the corresponding data, spark picks up the schema and can translate it back to the case class just fine. But if none of them are there, spark sees a missing column and fails.
Here's some code that demonstrates this.
import org.apache.spark.sql.SparkSession
object TestNulls {
case class Test(str: Option[String])
def main(args: Array[String]) {
val spark: SparkSession = SparkSession
.builder()
.getOrCreate()
import spark.implicits._
val dataset = Seq(
Test(None),
Test(None),
Test(None)
).toDS()
// Because all rows are null, writes {} for all rows
dataset.write.json("testpath")
// Fails because column `test` does not exist, even though it is an option
spark.read.json("testpath").as[Test].show()
}
}
Is there a way to tell spark not to fail on a missing nullable column? Failing that, is there a human readable format I can use that won't exhibit this behavior? The json is mainly so that we can write human-readabale files for testing and local development cases