Join spark dataset by considering one dataset when column values match in Java

Question

I have 2 spark datasets as follows

Dataset A :

id, mid, reputation
1   2    airline
2   3    airline
4   7    car    

Dataset B:

id, mid, reputation
1   2    bus
5   8    airline

I want to join (or some operation) Dataset A with B with the below output

  id, mid, reputation
    1   2    bus
    2   3    airline
    4   7    car
    5   8    airline

where its like a full outer join where there is no match between id and mid columns in both dataset, but for the matching id and mid columns, i want to Dataset A to be considered in the output.

I tried passing full and left_inner to the .join() method, but it didnt work.

Note : how to join two DataFrame and replace one column conditionally in spark this does not solve the problem completely.

It does not give below line in the output, Basically when you join dataset A with B. If B has records not in A, then I want them in output as well. If A has records that B does not have , I want them in output and if there are matching records (id and mid) in A and B, then take value from dataset B

5 8 airline

werner werner · Accepted Answer · 2021-06-15T20:34:16

Using a full outer join and applying coalesce to all columns should work:

select 
    coalesce(tabA.id, tabB.id) as id, 
    coalesce(tabA.mid, tabB.mid) as mid, 
    coalesce(tabB.reputation, tabA.reputation) as reputation
from 
    tabA full outer join tabB on tabA.id=tabB.id and tabA.mid=tabB.mid

Join spark dataset by considering one dataset when column values match in Java

1 Answers