2
votes

I would like to cross check my understanding about the differences in File Formats like Apache Avro and Apache Parquet in terms of Schema Evolution. Looking at various blogs and SO answers gives me the following understanding. I need to verify if my understanding is correct and also I would like to know if I am missing on any other differences with respect to Schema Evolution. Explanation is given in terms of using these file formats in Apache Hive.

  1. Adding column: Adding column (with default value) at the end of the columns is supported in both the file formats. I think adding column (with default value) in the middle of the columns can be supported in Parquet if hive table property is set "hive.parquet.use-column-names=true". Is this not the case?.

  2. Deleting Column: As far as deleting column at the end of the column list is concerned, I think it is supported in both the file formats, i.e if any of the Parquet/Avro file has the deleted column also since the reader schema(hive schema) doesn't have the deleted column, even if the writter's schema(actual Avro or Parquet file schema) has additional column, I think it will be easily ignored in both the formats. Deleting the column in the middle of the column list also can be supported if the property "hive.parquet.use-column-names=true" is set. Is my understanding correct?.

  3. Renaming column: When it comes to Renaming the column, since Avro has "column alias" option, column renaming is supported in Avro but not possible in Parquet because there are no such column aliasing option in Parquet. Am I right?.

  4. Data type change: This is supported in Avro because we can define multiple datatypes for a single column using union type but not possible in Parquet because there is no union type in Parquet.

Am I missing any other possibility?. Appreciate the help.

1

1 Answers

1
votes

hive.parquet.use-column-names=true needs to be set for accessing columns by name in Parquet. It is not only for column addition/deletion. Manipulating columns by indices would be cumbersome to the point of being infeasible.

There is a workaround for column renaming as well. Refer to https://stackoverflow.com/a/57176892/14084789

Union is a challenge with Parquet.