Why Spark outputs nullable = true, when schema inference left to Spark, in case of JSON?

Question

Why does Spark show nullable = true, when schema is not specified and its inference is left to Spark ?

// shows nullable = true for fields which are present in all JSON records.
spark.read.json("s3://s3path").printSchema()

Going through the class JsonInferSchema, can see that for StructType, explicitly nullable is set to true. But am unable to understand the reason behind it.

PS: My aim is to infer schema for a large JSON data set (< 100GB), and wanted to see if Spark provides the ability or would have to write a custom map-reduce job as highlighted in the paper: Schema Inference for Massive JSON Datasets. One major part is I want to know which fields are optional and which are mandatory (w.r.t the data set).

thebluephantom thebluephantom · Accepted Answer · 2020-04-25T13:40:45

Because it may do a sample of the data for schema inference in which it cannot 100% infer if null or not null, due to limited checking scope, sample size. Hence safer to set to null. That simple.

Why Spark outputs nullable = true, when schema inference left to Spark, in case of JSON?

1 Answers