I am trying to join two pyspark dataframes as below based on "Year" and "invoice" columns. But if "Year" is missing in df1, then I need to join just based on ""invoice" alone
df1:
Year invoice Status Item
2020 262 YES bag
2019 252 YES ball
2018 240 YES pen
2017 228 YES ink
2016 216 NO headphone
2015 213 NO bicycle
198 NO ribbon
175 YES phone
df2:
Year invoice
2020 262
2016 216
2014 175
2013 198
2019 252
Expected output:
Year invoice Status Item
2020 262 YES bag
2016 216 NO headphone
2014 175 YES phone
2013 198 NO ribbon
2019 252 YES ball
I am able to join df1 and df2 as below (only based on Year and invoice" column. If year is missing in df1, I need to add the logic of joining two columns based on invoice alone.
df_results = df1.join(df2, on=['Year', 'invoice'], how='left') \
.drop(df2.Year) \
.drop(df2.invoice)
Please let me know how to join if "Year" is not available in the df1, and dataframes should be joined based on "invoice" alone. Thanks.