0
votes

I have few doubts surrounding parquet compression between impala, hive and spark Here is the situation

  1. Table is Hive and data is inserted using Impala and table size is as below and table files extension is "data.0.parq" 59.0 M 177.1 M /user/hive/warehouse/database.db/tablename ( parquet + created in impala)
  2. Same table created in Hive tablename_snappy with snappy compression Set as TBLPROPERTIES ("parquet.compression"="SNAPPY") Data is inserted in Hive by using Tablename ( step1). 2a) Why the table size is more? 2b) File name is 000000_0 ( Is this expected) 64.6 M 193.7 M /user/hive/warehouse/database.db/tablename_parq ( parquet + snappy compression + created in Hive)
  3. In spark i read the tablename from step 1, did saveAsTable and file size is reduced as expected and file name is ****.snappy.parquet 39.0 M 117.1 M /user/hive/warehouse/atabase.db/tablename_spark ( parquet + snappy compression + created in Spark)
  4. Same table created in Impala with stored as Parquet and set COMPRESSION_CODEC=snappy; No change, i expected table size should reduce since i applied snappy compression. 59.0 M 177.1 M /user/hive/warehouse/database.db/tablename ( parquet + created in impala)

Please help me to understand how parquet compression works in Impla and Hive.

1

1 Answers

0
votes

Data size is varying due to default compression codecs select while creating the parquet file .

It is not application specific.

Just try before inserting data in hive table

 set COMPRESSION_CODEC =GZip

And you will find the file is compressed better .

Note by default compression is "snappy"

link for format's