Hive Index Performance

Question

I have two tables Table A and Table B which are 100GB and 35GB in size respectively. Also both the tables are compact indexed on the same column which is prodID.

I am facing an issue here where I am getting the same response time with or without index for a below query. It takes 30 minute to process the query.

select a.* from TableA a inner join TableB b on a.prodID=b.prodID.

I have 19 nodes cluster setup. Can you please advise me if I am missing any configuration here or doing something wrong.

Regards, Prabu

I don't think indices will help much with the join. You should consider defining your input tables clustered by prodID. See cwiki.apache.org/confluence/display/Hive/… — LiMuBei

Chetna C Chetna C · Accepted Answer · 2016-01-27T14:44:53

I think you should try putting large table i.e Table A at the last, or stream table A to improve performance. You can try following query to stream table.

select /*+STREAMTABLE(a)*/ a.* from TableA a inner join TableB b on a.prodID=b.prodID;

Please refer Tips using joins in hive for more information.

Hive Index Performance

1 Answers