2
votes

I would like to ask about Hive partitioning performance. How many records (in term of rows) would I need in order to see the partitioning performance?

Currently, I have 2 million+ records and I have split the table into 2 partitions. My partitioning condition is as follow:

  • ADD PARTITION (year_month=’2017_07’)
  • ADD PARTITION (year_month=’2017_08’)
  • INSERT OVERWRITE TABLE T PARTITION (year_month='2017_07') SELECT * FROM T WHERE st_time < '2017_08_01 00:00:00.0';
  • INSERT OVERWRITE TABLE T PARTITION (year_month='2017_08') SELECT * FROM T WHERE st_time >= '2017_08_01 00:00:00.0';

When I tried to do some performance testing between the partitioned table and non-partitioned table, the performance of both tables is relatively the same. My performance testing query is as follows:

  • SELECT * FROM T WHERE st_time < '2017_08_01 00:00:00.0';
  • SELECT * FROM non_part_table WHERE st_time < '2017_08_01 00:00:00.0';

Am I missing out on some important points?

Also, does partition work on derived column/function? E.g. date_format(st_time, yyyy-MM)

Any help or advice is greatly appreciated.

1

1 Answers

2
votes

the performance of both tables is relatively the same

You forgot the WHERE clause against the partition. Partitions only improve performance when you are selecting data within them.

SELECT * FROM T 
WHERE year_month = '2017_07'
-- AND st_time < '2017_08_01 00:00:00.0'
;

Without this, you're still scanning the whole table for the st_time values.

You can add an EXPLAIN to the queries to see the difference

You'll get additional performance improvements by converting data into Parquet or ORC