There are three questions aimed at some details on Hive skew join optimization:
Question 1
In https://cwiki.apache.org/confluence/display/Hive/Skewed+Join+Optimization , we know the basic idea about hive skew join optimize... But there are some details which trouble me:
for example:
select A.id from A join B on A.id = B.id
in tableA ,there are three skew keys: id=1, id=2, id=3, the other keys are equally distributed, will it launch 4 MR jobs?
job 1 to deal with the equally distributed keys ;
job 2 to deal with skew key id=1 ;
job 3 to deal with skew key id=2 ;
job 4 to deal with skew key id=3 ;
is that right ? many thanks .
question 2
as we know ,the key point about skew join optimize is that we can use map join to deal with the skew join key ,such as 1 ,2 ,3 . So if this does not fit up with the map join condition , will it fallback to ordinary join?
question 3
the default setting is : hive.skewjoin.key= 100000 , which is usually too small for practical query. Is it possible to decide dynamically the triggering conditions for skew join, for example based on the JVM heap size and the total number rows of the skew table?