I have a Hadoop cluster and I use Hive for querying, and I want to join two large tables one of which has small buckets, and from what I read it seems that if I bucket both tables on join-key, that would help the performance.
So my settings are:
- Bucketed both tables on the join key into same number of buckets,
- Buckets of the smaller table fits in memory,
- set hive.optimize.bucketmapjoin = true;
- Run the following query:
SELECT /*+ MAPJOIN(a) */
count(*)
FROM a JOIN B ON a.join_key = b.join_key;
Question 1: Is the above settings enough to trigger bucket map join?
Question 2: My understanding of bucket map join is that it starts a local task which creates hash table of each of the buckets and then uploades hashed buckets to each of the mappers. Is this understanding correct?
Question 3: If the above understanding correct, then why Hive does the hashing locally? Why it doesn't upload the raw bucket then does the hashing in map task, therefore achieve parallel processing and more speed?