My hive query is hanging and I don't know why (using hadoop 0.20.1, hive 0.9).
Query:
SELECT
a.field1 FROM table_1 a
LEFT SEMI JOIN
(SELECT DISTINCT(usrId) FROM table_2 b
WHERE soemthing=true ORDER BY rand() limit 1000) random_user_ids
WHERE a.usrId=random_user_ids.usrId
EXPLAIN gives me back:
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-2 depends on stages: Stage-1
Stage-3 depends on stages: Stage-2
Stage-0 is a root stage
Data-Set
- ~200M entries inside table
- the relation of table_1.usrId to table_1 should be about 1:40, i.e. 1000*40=40000 should be returned in above query
Observations:
- Job hangs at final Stage-3 at 33% (reduce > sort), where the joining is happening (join result is about 40000)
- Apart from being slow why is reduce > sort part of Stage-3 at all? It should only join stuff not ORDER anything
- reducer size is only 1 (because of sort?), which is nearly always bad as it doesn't scale.
If you need more input (e.g. more verbose EXPLAIN info, more cluster info) just tell.
Thanks!