0
votes

I have a theta join in SAS which need to be translate into Hive.

SAS:

select a.id,b.name from employee a 
left outer join company b 
on ( a.id=b.id and a.joindate>=b.joindate and a.releasedate < b.releasedate)

Since this is not inner join, I am not getting proper results if I add non equi-join in the where condition (all non matched records from left table are missing).

Tried below in Hive:

select a.id,b.name from employee a 
left outer join company b 
on ( a.id=b.id) 
where a.joindate>=b.joindate and a.releasedate < b.releasedate

Any suggestions?

1

1 Answers

0
votes

Just as you may have realized, left join keeps all the items from the Preserved Row Table (employee), whereas your where filters out those items if a.joindate<b.joindate or a.releasedate >= b.releasedate.

Those on conditions are logically interpreted as:

  1. For each item li from the left table, whenever an item ri from the right table is found to meet the on conditions, make a new item ni whose columns values are the combination of li and ri. Then put ni in the result set. This may duplicate rows from the left table.
  2. If li fails to find a match, make one copy of it, fill the right columns with nulls. Put this item in the result set too.

So we can emulate this behavior by:

  1. Loosen the on conditions by keeping just equality conditions within the on clause;
  2. Filter out extra rows indroduced by the loosened on conditions, i.e., rows that fail to meet conditions other than equality conditions.

The resultant query may look like this:

select 
    a.id,
    b.name
from employee a 
left outer join company b 
on (a.id=b.id)
where (b.id is null 
       or (a.joindate>=b.joindate and a.releasedate<b.releasedate))