Join two tables on multiple columns without duplicate rows

Question

I have two tables:

Table A:

id

Table B:

id1
id2
... (lots of other columns)

I want to get all the rows in B where either B.id1 or B.id2 is A.id (all A.id are distinct. This isn't the case for B).

I've tried the query:

create table C as
select B.*
from B 
join A on (A.id = B.id1 or A.id = B.id2);

This works great except it duplicates rows where id1 and id2 are both in id. A is a relatively small table (around 5000 rows) whereas B is quite large (around 1 billion rows).

Solutions that I've thought of involve grouping by the columns of B so it eliminates distinct rows, collect the ids of A into an array and searching for id1 and id2 in the array, and using 2 separate queries which search for id1 and id2 respectively and then union the two tables. These all seem quite inefficient especially considering the size of the tables. Is there a better way of doing this that I'm missing? Thanks.

Gordon Linoff Gordon Linoff · Accepted Answer · 2016-07-21T00:57:12

I would use exists:

select b.*
from b
where exists (select 1 from a where a.id = b.id1) or
      exists (select 1 from a where a.id = b.id2);

In most databases, this would be the most efficient method for this type of logic. I'm not 100% sure that this is true in Hive, but it is definitely worth a try.

An alternative approach would be left joins:

select b.*
from b left join
     a a1
     on b.id1 = a1.id left join
     a a2
     on b.id2 = a2.id
where a1.id is not null or a2.id is not null;

This might have better performance in Hive, if the exists does not have good optimization.

Join two tables on multiple columns without duplicate rows

1 Answers