Spark parquet schema evolution

Question

I have a partitioned hdfs parquet location which is having different schema is different partition.

Say 5 columns in first partition, 4 cols in 2nd partition. Now I try to read the base Parquet path and then filter the 2nd partition.

This gives me 5 columns in the DF even though I have only 4 columns in Parquet files in 2nd partition. When I read the 2nd partition directly, it gives correct 4 cols. How to fix this.

notNull notNull · Accepted Answer · 2020-03-17T17:36:14

You can specify the required schema(4 columns) while reading the parquet file!

Then spark only reads the fields that included in the schema, if field not exists in the data then null will be returned.

Example:

import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._

val sch=new StructType().add("i",IntegerType).add("z",StringType)
spark.read.schema(sch).parquet("<parquet_file_path>").show()

//here i have i in my data and not have z field
//+---+----+
//|  i|   z|
//+---+----+
//|  1|null|
//+---+----+

Spark parquet schema evolution

2 Answers