hive doesn't change parquet schema

Question

I've a problem with alter table that changes the table schema but not the parquet schema.

For example I've a PARQUET table with these columns:

column1(string)    column2(string) 
column3(string)    column4(string)
column5(bigint)

Now, I try to change the table's schema with

ALTER TABLE name_table DROP COLUMN column3;

With DESCRIBE TABLE I can see that the column2 there is not anymore;

Now I try to execute select * from table but i receive an error like this :

"data.0.parq' has an incompatible type with the table schema for column column4. Expected type: INT64. Actual type: BYTE_ARRAY"

The values of deleted column are yet present in parquet file that has 5 columns and not 4 (as the table schema)

This is a bug? How I can change the Parquet file's schema using Hive?

Silver Blaze Silver Blaze · Accepted Answer · 2016-03-29T23:02:15

This is not a bug. When you drop the columns, that just updates the definition in Hive Metastore, which is just the information about the table. The underlying files on HDFS remain unchanged. Since the parquet metadata is embedded in the files , they have no idea what the metadata change has been. Hence you see this issue.

hive doesn't change parquet schema

3 Answers