Here I am having python utility to create multiple parquet files using Pyarrow library for Single data set as data set size is huge for one day. Here parquet file contains 10K parquet row groups in each split parquet file, here in end we are combining the split files into one file to create a large single parquet file. Here I am creating two Impala table with a merged file and multiple split files.
When split file data loaded in Impala table and tried to querying it result is coming faster in seconds but when Impala table created on single merged parquet file. It will giving performance issue compared to mentioned split files Impala table. I am not able to identify difference between these two tables, when tried to compute stats on Impala tables.
Any idea, why this performance behavior difference between for multi-split parquet files Impala table and single merged split files Impala table.