From spark documentation:
Since schema merging is a relatively expensive operation, and is not a necessity in most cases, we turned it off by default starting from 1.5.0. You may enable it by setting data source option mergeSchema to true when reading Parquet files (as shown in the examples below), or setting the global SQL option spark.sql.parquet.mergeSchema to true.
(https://spark.apache.org/docs/latest/sql-data-sources-parquet.html)
My understanding from the documentation is that if I have multiple parquet partitions with different schemas, spark will be able to merge these schemas automatically if I use spark.read.option("mergeSchema", "true").parquet(path).
This seems like a good option if I don't know at query time what schemas exist in these partitions.
However, consider the case where I have two partitions, one using an old schema, and one using a new schema that differs only in having one additional field. Let's also assume that my code knows the new schema and I'm able to pass this schema in explicitly.
In this case, I would do something like spark.read.schema(my_new_schema).parquet(path). What I'm hoping Spark would do in this case is read in both partitions using the new schema and simply supply null values for the new column to any rows in the old partition. Is this the expected behavior? Or do I need also need to use option("mergeSchema", "true") in this case as well?
I'm hoping to avoid using the mergeSchema option if possible in order to avoid the additional overhead mentioned in the documentation.