I have few doubts surrounding parquet compression between impala, hive and spark Here is the situation
- Table is Hive and data is inserted using Impala and table size is as below and table files extension is "data.0.parq" 59.0 M 177.1 M /user/hive/warehouse/database.db/tablename ( parquet + created in impala)
- Same table created in Hive tablename_snappy with snappy compression Set as TBLPROPERTIES ("parquet.compression"="SNAPPY") Data is inserted in Hive by using Tablename ( step1). 2a) Why the table size is more? 2b) File name is 000000_0 ( Is this expected) 64.6 M 193.7 M /user/hive/warehouse/database.db/tablename_parq ( parquet + snappy compression + created in Hive)
- In spark i read the tablename from step 1, did saveAsTable and file size is reduced as expected and file name is ****.snappy.parquet 39.0 M 117.1 M /user/hive/warehouse/atabase.db/tablename_spark ( parquet + snappy compression + created in Spark)
- Same table created in Impala with stored as Parquet and set COMPRESSION_CODEC=snappy; No change, i expected table size should reduce since i applied snappy compression. 59.0 M 177.1 M /user/hive/warehouse/database.db/tablename ( parquet + created in impala)
Please help me to understand how parquet compression works in Impla and Hive.