What exactly is the benefit of partitioning and bucketing a Hive table at the same time? I have a table "Orders" which contains 1M records but, the records are from 6 specific cities. Now if I only bucket my table Orders based on cities, I get 6 different folders in my warehouse dir (in Hive), each of them corresponding to a particular city and data for it.
When I partition and then bucket my table Orders, still then I can see the same 6 folders in my warehouse dir under the hive. I tried using 16 buckets but still, the folders for data are divided as per the cities. Below is the code:
create table Orders ( id int, name string, address string)
partitioned by (city string)
clustered by (id) into 16 buckets
row format delimited fields terminated by ','
stored as TEXTFILE
Can someone please outline why Hive is behaving this way. Also, I ran some performance metrics such as count and grouping. I did not find any significant improvement in the partitioned bucketed table vs only bucketed or only partitioned.
Thank you.
I'm running Hadoop on 12 cores, 36 Gb RAM with 8 Clusters.