1
votes

I want to subset all rows of TableA with key that appears in TableB.

TableA -

Idn   X  Y ...
 12   *  *
 13   *  *
 14   *  *

TableB -

Idn   A B C ...
 12   * * *
 12   * * *
 14   * * *

I want to select all elements in TableA with key appearing in TableB.

The following SQL query doesn't work in Hive -

select * from TableA where Idn in (select distinct Idn from TableB) a;

Is there any other way I can do it?

Can it be done in a single query?

Note that neither table need not be unique in the Idn.

1
Actually this eliminates the intermediate table - select a.* from tablea a inner join (select distinct idn from tableb) b on a.idn=b.idn. Is this the best way? - KalEl

1 Answers

2
votes

Something like a SEMI JOIN should do it:

SELECT * FROM TableA LEFT SEMI JOIN TableB on (TableA.Idn = TableB.Idn)

LEFT SEMI JOIN implements the uncorrelated IN/EXISTS subquery semantics in an efficient way. As of Hive 0.13 the IN/NOT IN/EXISTS/NOT EXISTS operators are supported using subqueries so most of these JOINs don't have to be performed manually anymore. The restrictions of using LEFT SEMI JOIN is that the right-hand-side table should only be referenced in the join condition (ON-clause), but not in WHERE- or SELECT-clauses etc.