Slow hive query, why (left-semi-join)?

Question

My hive query is hanging and I don't know why (using hadoop 0.20.1, hive 0.9).

Query:

SELECT 
   a.field1 FROM table_1 a 
LEFT SEMI JOIN 
   (SELECT DISTINCT(usrId) FROM table_2 b 
       WHERE soemthing=true ORDER BY rand() limit 1000) random_user_ids 
WHERE a.usrId=random_user_ids.usrId

EXPLAIN gives me back:

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-2 depends on stages: Stage-1
  Stage-3 depends on stages: Stage-2
  Stage-0 is a root stage

Data-Set

~200M entries inside table
the relation of table_1.usrId to table_1 should be about 1:40, i.e. 1000*40=40000 should be returned in above query

Observations:

Job hangs at final Stage-3 at 33% (reduce > sort), where the joining is happening (join result is about 40000)
Apart from being slow why is reduce > sort part of Stage-3 at all? It should only join stuff not ORDER anything
reducer size is only 1 (because of sort?), which is nearly always bad as it doesn't scale.

If you need more input (e.g. more verbose EXPLAIN info, more cluster info) just tell.

Thanks!

cabad cabad · Accepted Answer · 2013-07-11T14:58:40

The JOIN condition should be included in an ON clause, not in the WHERE clause.

Example of syntax:

SELECT a.key, a.val
FROM a LEFT SEMI JOIN b ON (a.key = b.key)

Slow hive query, why (left-semi-join)?

1 Answers