0
votes

Here I am having python utility to create multiple parquet files using Pyarrow library for Single data set as data set size is huge for one day. Here parquet file contains 10K parquet row groups in each split parquet file, here in end we are combining the split files into one file to create a large single parquet file. Here I am creating two Impala table with a merged file and multiple split files.

When split file data loaded in Impala table and tried to querying it result is coming faster in seconds but when Impala table created on single merged parquet file. It will giving performance issue compared to mentioned split files Impala table. I am not able to identify difference between these two tables, when tried to compute stats on Impala tables.

Any idea, why this performance behavior difference between for multi-split parquet files Impala table and single merged split files Impala table.

1

1 Answers

1
votes

Historically, good Parquet performance has been associated with large Parquet files. However, in reality, good performance is not a result of large files but large rowgroups instead (up to the HDFS block size).

Putting row groups one after the other without merging them will not change Spark performance significantly, but it will make Impala a lot slower.

Some JIRA-s in the topic:

What you could do instead of merging the small Parquet files is to put the fresh data in a separate table that could be in a less efficient format (textfile, Avro or many small Parquet files) and then use Hive, Spark or Impala to query the contents of that table and bulk insert it into the production table. This will create properly sized Parquet files with efficient row group size.