1
votes

I have a Hadoop cluster and I use Hive for querying, and I want to join two large tables one of which has small buckets, and from what I read it seems that if I bucket both tables on join-key, that would help the performance.

So my settings are:

  • Bucketed both tables on the join key into same number of buckets,
  • Buckets of the smaller table fits in memory,
  • set hive.optimize.bucketmapjoin = true;
  • Run the following query:
SELECT /*+ MAPJOIN(a) */
 count(*)
FROM a JOIN B ON  a.join_key = b.join_key;

Question 1: Is the above settings enough to trigger bucket map join?

Question 2: My understanding of bucket map join is that it starts a local task which creates hash table of each of the buckets and then uploades hashed buckets to each of the mappers. Is this understanding correct?

Question 3: If the above understanding correct, then why Hive does the hashing locally? Why it doesn't upload the raw bucket then does the hashing in map task, therefore achieve parallel processing and more speed?

1

1 Answers

-1
votes

Is the above settings enough to trigger bucket map join?

Answer: 1) If "the Total table/partition size is big, not good for map join", then go for bucket map join. settings required are : set hive.optimize.bucketmapjoin = true;

2) If "No limit on table/partition size very LARGE i.e the table size" then go for SortMerge bucket join. settings required are : set hive.optimize.bucketmapjoin = true; set hive.optimize.bucketmapjoin.sortedmerge = true; set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;

My understanding of bucket map join is that it starts a local task which creates hash table of each of the buckets and then uploades hashed buckets to each of the mappers. Is this understanding correct?

Answer: In case of Bucket map join, 1) hive runs a local map reduce join to create HashTable files, 2) it compress and archive the file and load to distributed cache (i- Large hash table file will slow down the propagation of Distributed Cache. ii- Mappers are waiting for the hash tables file from Distributed Cache. ) 3) loads to the mappers of the map Join task.

If the above understanding correct, then why Hive does the hashing locally? Why it doesn't upload the raw bucket then does the hashing in map task, therefore achieve parallel processing and more speed?

Answer: If the files are large then we will have below problem i- Large hash table file will slow down the propagation of Distributed Cache. ii- Mappers are waiting for the hash tables file from Distributed Cache.

So doing the below steps will give more performance:

1) hive runs a local map reduce join to create HashTable files, 2) it compress and archive the file and load to distributed cache
3) loads to the mappers of the map Join task.

This give better performance then the map join or normal joins.