Join two pyspark dataframes with combination of columns keeping only unique values

Question

I have two pyspark dataframes with multiple columns. I want to join the two dataframes on specific columns from each dataframe and need the resultant dataframe to have all values from df1 in addition to new unique rows from df2.

The uniqueness is to be decided on basis of the combination of the required columns, for example in the below scenario, combination taken is Col2:Col7

df1:
Col1    Col2    Col3    Col4    Col5    Col6    Col7    Col8    Col9    Col10
94159   New     Store   Puma    Odd     East    Blue    Button  Zero    Max
421301  New     Store   Lexi    Even    West    Red     Button  Zero    Max
226024  New     Online  Puma    Odd     East    Blue    Button  Zero    Max
560035  Old     Store   Puma    Odd     East    Black   Button  Zero    Max

df2:
Col2    Col3    Col4    Col5    Col6    Col7    Col8    Col9    Col10
New     Store   Puma    Odd     East    Blue    Button  Zero    Max
New     Store   Lexi    Even    West    Red     Button  Zero    Max
New     Stock   Puma    Odd     East    Blue    Button  Zero    Max
Old     Online  Puma    Odd     East    Black   Button  Zero    Max

resultant_dataframe:
Col1    Col2    Col3    Col4    Col5    Col6    Col7    Col8    Col9    Col10
94159   New     Store   Puma    Odd     East    Blue    Button  Zero    Max
421301  New     Store   Lexi    Even    West    Red     Button  Zero    Max
226024  New     Online  Puma    Odd     East    Blue    Button  Zero    Max
560035  Old     Store   Puma    Odd     East    Black   Button  Zero    Max
null    New     Stock   Puma    Odd     East    Blue    Button  Zero    Max
null    Old     Online  Puma    Odd     East    Black   Button  Zero    Max

I am using the below approach:

resultant_dataframe = df1.join(df2, ['Col2', 'Col3', 'Col4', 'Col5', 'Col6', 'Col7'], 'left')
resultant_dataframe = resultant_dataframe.dropDuplicates(subset=['Col2', 'Col3', 'Col4', 'Col5', 'Col6', 'Col7'])

But i am missing something since all the unique rows are not getting updated in the resultant dataframe. Is there any other way to get this done?

Luiz Viola Luiz Viola · Accepted Answer · 2021-02-01T14:01:13

If you do a left join, it will only return the rows from the df at left, in this case the df1.

you should change the join parameter to full:

resultant_dataframe = df1.join(df2, ['Col2', 'Col3', 'Col4', 'Col5', 'Col6', 'Col7'], 'full')

Join two pyspark dataframes with combination of columns keeping only unique values

1 Answers