1
votes

What exactly is the benefit of partitioning and bucketing a Hive table at the same time? I have a table "Orders" which contains 1M records but, the records are from 6 specific cities. Now if I only bucket my table Orders based on cities, I get 6 different folders in my warehouse dir (in Hive), each of them corresponding to a particular city and data for it.

When I partition and then bucket my table Orders, still then I can see the same 6 folders in my warehouse dir under the hive. I tried using 16 buckets but still, the folders for data are divided as per the cities. Below is the code:

      create table Orders ( id int, name string, address string)
      partitioned by (city string)
      clustered by (id) into 16 buckets
      row format delimited fields terminated by ','
      stored as TEXTFILE

Can someone please outline why Hive is behaving this way. Also, I ran some performance metrics such as count and grouping. I did not find any significant improvement in the partitioned bucketed table vs only bucketed or only partitioned.

Thank you.

I'm running Hadoop on 12 cores, 36 Gb RAM with 8 Clusters.

1

1 Answers

5
votes

Partitioning and Bucketing are two different types of splitting your data at the physical layer.

As you saw, when you partition a table by a column, a directory will be created for each value of the column. As such, you'll typically want to partition on a column that has low cardinality. One of the most common partition columns you'll see is date.

With bucketing, the column value is hashed into a fixed number of buckets. This also physically splits your data. In your case, if you inspect the files in the city directories, you'll see 16 files, 1 for each bucket. Bucketing is typically used for high cardinality columns.

So, what is the advantage of partitioning and bucketing? Since the data is physically "partitioned", the query layer can apply two types of optimizations called partition pruning and bucket pruning. These optimizations will kick in when a WHERE clause is applied that can allow the optimizer to apply the pruning strategies. For example, in your case you have 6 directories (cities) times 16 files (id buckets), so you have a total of 96 files in your table. If you included a where clause for city = "city1", then only 16 files will be scanned since partition pruning will kick in. If you used a where clause for id = 10101, then only 6 files will be scanned since bucket pruning can be applied. If you apply both a city filter and an id filter, the only 1 file will need to be scanned.

EDIT: As pointed out in the comments, bucket pruning has only been implemented in the Tez engine. So, while in theory, buckets can be pruned, the optimization has not been implemented yet in Hive MR.