Why does Spark show nullable = true, when schema is not specified and its inference is left to Spark ?
// shows nullable = true for fields which are present in all JSON records.
spark.read.json("s3://s3path").printSchema()
Going through the class JsonInferSchema, can see that for StructType, explicitly nullable is set to true. But am unable to understand the reason behind it.
PS: My aim is to infer schema for a large JSON data set (< 100GB), and wanted to see if Spark provides the ability or would have to write a custom map-reduce job as highlighted in the paper: Schema Inference for Massive JSON Datasets. One major part is I want to know which fields are optional and which are mandatory (w.r.t the data set).