2
votes

I've a problem with alter table that changes the table schema but not the parquet schema.

For example I've a PARQUET table with these columns:

column1(string)    column2(string) 
column3(string)    column4(string)
column5(bigint)

Now, I try to change the table's schema with

ALTER TABLE name_table DROP COLUMN column3; 

With DESCRIBE TABLE I can see that the column2 there is not anymore;

Now I try to execute select * from table but i receive an error like this :

"data.0.parq' has an incompatible type with the table schema for column column4. Expected type: INT64. Actual type: BYTE_ARRAY"

The values of deleted column are yet present in parquet file that has 5 columns and not 4 (as the table schema)

This is a bug? How I can change the Parquet file's schema using Hive?

3
are you sure you're using hive and not impala ? - Roberto Congiu

3 Answers

2
votes

This is not a bug. When you drop the columns, that just updates the definition in Hive Metastore, which is just the information about the table. The underlying files on HDFS remain unchanged. Since the parquet metadata is embedded in the files , they have no idea what the metadata change has been. Hence you see this issue.

2
votes

The solution is described here. If you want to add a column(s) to a parquet table and be compatible with both impala and hive, you need to add the column(s) at the end.

If you alter the table and change column names or drop a column, that table will no longer be compatible with impala.

0
votes

I had the same error after adding a column to hive table.

Solution is to set the below query option at each session

set PARQUET_FALLBACK_SCHEMA_RESOLUTION=name;

If you're using Cloudera distribution, set it permanently in Cloudera Manager => Impala configuration => Impala Daemon Query Options Advanced Configuration Snippet (Safety Valve)

set config value as PARQUET_FALLBACK_SCHEMA_RESOLUTION=name