I have two pyspark dataframes with multiple columns. I want to join the two dataframes on specific columns from each dataframe and need the resultant dataframe to have all values from df1 in addition to new unique rows from df2.
The uniqueness is to be decided on basis of the combination of the required columns, for example in the below scenario, combination taken is Col2:Col7
df1:
Col1 Col2 Col3 Col4 Col5 Col6 Col7 Col8 Col9 Col10
94159 New Store Puma Odd East Blue Button Zero Max
421301 New Store Lexi Even West Red Button Zero Max
226024 New Online Puma Odd East Blue Button Zero Max
560035 Old Store Puma Odd East Black Button Zero Max
df2:
Col2 Col3 Col4 Col5 Col6 Col7 Col8 Col9 Col10
New Store Puma Odd East Blue Button Zero Max
New Store Lexi Even West Red Button Zero Max
New Stock Puma Odd East Blue Button Zero Max
Old Online Puma Odd East Black Button Zero Max
resultant_dataframe:
Col1 Col2 Col3 Col4 Col5 Col6 Col7 Col8 Col9 Col10
94159 New Store Puma Odd East Blue Button Zero Max
421301 New Store Lexi Even West Red Button Zero Max
226024 New Online Puma Odd East Blue Button Zero Max
560035 Old Store Puma Odd East Black Button Zero Max
null New Stock Puma Odd East Blue Button Zero Max
null Old Online Puma Odd East Black Button Zero Max
I am using the below approach:
resultant_dataframe = df1.join(df2, ['Col2', 'Col3', 'Col4', 'Col5', 'Col6', 'Col7'], 'left')
resultant_dataframe = resultant_dataframe.dropDuplicates(subset=['Col2', 'Col3', 'Col4', 'Col5', 'Col6', 'Col7'])
But i am missing something since all the unique rows are not getting updated in the resultant dataframe. Is there any other way to get this done?