1
votes

I'm trying to do a full otuer join between 5 tables. I am trying to do it using an "anchor table" and then performing a left outer join. Select all distinct key fields from each table then from those do left outer joins on each of the 5 tables. Doing this with HIVE. Here is what I have...

select
COALESCE(a.key,b.key,c.key,d.key,e.key,NULL) key,
a.field_1,
b.field_2,
c.field_3,
d.field_4,
e.field_5
from (
    select distinct key from table_1
    union
    select distinct key from table_2
    union
    select distinct key from table_3
    union
    select distinct key from table_4
    union
    select distinct key from table_5
) as keyvalues
    LEFT OUTER JOIN table_1 as a on a.key = keyvalues.key
    LEFT OUTER JOIN table_2 as b on b.key = keyvalues.key
    LEFT OUTER JOIN table_3 as c on c.key = keyvalues.key
    LEFT OUTER JOIN table_4 as d on d.key = keyvalues.key
    LEFT OUTER JOIN table_5 as e on e.key = keyvalues.key

The above code is generic,I have many more fields than that but that is the idea. What my results are showing are duplicate key values. That's because key is not necessarily a primary key across those tables. So my result i would like to have no duplicates. Almost if I could take a distinct of the "keyvalues" variable before I start doing those left outer joins.

2
union should do a distinct is it possible the same key appears multiple times in the a, b, c, d, or e tables? - Hogan
For some reason I am seeing duplicate key values in the result. Unless, of course, that's just from duplicates being in one or more of the tables themselves... - Brian Gurka
That is from the table being joined to - Hogan
The coalesce key part can just be keyvalues.key. - avery_larry

2 Answers

1
votes

I would suggest a very different approach, which I will illustrate for 3 tables:

select key,
       max(field1) as field1, max(field2) as field2, max(field3) as field3
from ((select key, field1, null as field2, null as field3,
              row_number() over (partition by key order by key) as seqnum
       from table1
      ) union all
      (select key, null as field1, field2, null as field3,
              row_number() over (partition by key order by key) as seqnum
       from table2
      ) union all
      (select key, null as field1, null as field2, field3,
              row_number() over (partition by key order by key) as seqnum
       from table3
      )
     ) t
group by key, seqnum;

This will create a "vertical list" for each field for each key.

1
votes

If you prefer joins then subqueries with row_number will do:

with 

table1 as (
select key, col1...coln
  from
      (select key, col1...coln,
              row_number() over(partition by key order by... ) rn
       from table_1)s
 where s.rn=1
),

table2 as (
select key, col1...coln
  from
      (select key, col1...coln,
              row_number() over(partition by key order by... ) rn
       from table_2)s
 where s.rn=1
),

table3 ... and so on ...

keyvalues as( --can use UNION ALL, because subqueries are already contain unique keys
select key from table1
union all 
select key from table2
union all
... and so on
)

select
COALESCE(a.key,b.key,c.key,d.key,e.key) key,
a.col1,
b.coln,
...
from keyvalues --join with de-duped sub-queries
    LEFT OUTER JOIN table1 as a on a.key = keyvalues.key
    LEFT OUTER JOIN table2 as b on b.key = keyvalues.key
    ...