Indexing in Hive

Question

I have a fairly large Hive table (~20 Billion records) on a hadoop cluster, and I need to do several joins on it.

Is it possible to index this table on a key? For example, if the table name is table1 and I want to do multiple joins of table1 with table2, table3 and table 4 on column key what would be the most efficient way to do this?

If relevant tables 2-4 are relatively very small (~100 million each)

David Gruzman David Gruzman · Accepted Answer · 2012-07-23T12:19:06

IMHO Index will help in this case only when it has very high selectivity on table1. If serious part of the table1 (I would speculate more then 1-5%) is going to be result of the join, indexes are not going to be effective. The reason is that you will read all pages / blocks in any case.
Hive can decide that one table is small, and use it as in memory hashtable on each mapper thereof performing join without shuffling. https://issues.apache.org/jira/browse/HIVE-195
You also can get a improvement from the fact that your tables are partitioned in the case of mapjoin: https://issues.apache.org/jira/browse/HIVE-917
I am not sure if Hive is capable to do common join using partitioning information.

Indexing in Hive

1 Answers