pyspark join with null conditions

Question

I am trying to join two pyspark dataframes as below based on "Year" and "invoice" columns. But if "Year" is missing in df1, then I need to join just based on ""invoice" alone

df1:

Year    invoice    Status   Item
2020    262        YES      bag
2019    252        YES      ball
2018    240        YES      pen
2017    228        YES      ink
2016    216        NO       headphone
2015    213        NO       bicycle
        198        NO       ribbon
        175        YES      phone

df2:

Year    invoice
2020    262
2016    216
2014    175
2013    198
2019    252

Expected output:

Year    invoice    Status   Item
2020    262        YES      bag
2016    216        NO       headphone
2014    175        YES      phone
2013    198        NO       ribbon
2019    252        YES      ball

I am able to join df1 and df2 as below (only based on Year and invoice" column. If year is missing in df1, I need to add the logic of joining two columns based on invoice alone.

df_results = df1.join(df2, on=['Year', 'invoice'], how='left') \
                .drop(df2.Year) \
                .drop(df2.invoice)

Please let me know how to join if "Year" is not available in the df1, and dataframes should be joined based on "invoice" alone. Thanks.

mck mck · Accepted Answer · 2020-11-18T16:11:49

I don't have your code to test this, but I would try to add a condition to the join operation:

cond = ((df1.Year == df2.Year) | df1.Year.isNull()) & (df1.invoice == df2.invoice)
df_results = df1.join(df2, on=cond, how='left') \
                .drop(df2.Year) \
                .drop(df2.invoice)

pyspark join with null conditions

1 Answers