I would like to ask about Hive partitioning performance. How many records (in term of rows) would I need in order to see the partitioning performance?
Currently, I have 2 million+ records and I have split the table into 2 partitions. My partitioning condition is as follow:
ADD PARTITION (year_month=’2017_07’)
ADD PARTITION (year_month=’2017_08’)
INSERT OVERWRITE TABLE T PARTITION (year_month='2017_07') SELECT * FROM T WHERE st_time < '2017_08_01 00:00:00.0';
INSERT OVERWRITE TABLE T PARTITION (year_month='2017_08') SELECT * FROM T WHERE st_time >= '2017_08_01 00:00:00.0';
When I tried to do some performance testing between the partitioned table and non-partitioned table, the performance of both tables is relatively the same. My performance testing query is as follows:
SELECT * FROM T WHERE st_time < '2017_08_01 00:00:00.0';
SELECT * FROM non_part_table WHERE st_time < '2017_08_01 00:00:00.0';
Am I missing out on some important points?
Also, does partition work on derived column/function? E.g. date_format(st_time, yyyy-MM)
Any help or advice is greatly appreciated.