0
votes

I have two tables:

Table A:

  • id

Table B:

  • id1
  • id2
  • ... (lots of other columns)

I want to get all the rows in B where either B.id1 or B.id2 is A.id (all A.id are distinct. This isn't the case for B).

I've tried the query:

create table C as
select B.*
from B 
join A on (A.id = B.id1 or A.id = B.id2);

This works great except it duplicates rows where id1 and id2 are both in id. A is a relatively small table (around 5000 rows) whereas B is quite large (around 1 billion rows).

Solutions that I've thought of involve grouping by the columns of B so it eliminates distinct rows, collect the ids of A into an array and searching for id1 and id2 in the array, and using 2 separate queries which search for id1 and id2 respectively and then union the two tables. These all seem quite inefficient especially considering the size of the tables. Is there a better way of doing this that I'm missing? Thanks.

1
What's wrong with using select distinct? - sgeddes
It would be too slow considering the size of my data. - TheSlimyDog

1 Answers

3
votes

I would use exists:

select b.*
from b
where exists (select 1 from a where a.id = b.id1) or
      exists (select 1 from a where a.id = b.id2);

In most databases, this would be the most efficient method for this type of logic. I'm not 100% sure that this is true in Hive, but it is definitely worth a try.

An alternative approach would be left joins:

select b.*
from b left join
     a a1
     on b.id1 = a1.id left join
     a a2
     on b.id2 = a2.id
where a1.id is not null or a2.id is not null;

This might have better performance in Hive, if the exists does not have good optimization.