0
votes

I'm trying to do a left join in pyspark on two columns of which just one is named identical:

How could I drop both columns of the joined dataframe df2.date and df2.accountnr?

dfAll = (
    df1
    .join(df2, 
        [df1.order_date == df2.date, df1.accountnr== df2.accountnr], 
        how = 'left')
)

Adding .drop(df2.date, df2.accountnr) (error: col should be a string) or .drop('date', 'accountnr') (error: ambiguous column names) does unfortunately not work.

The only solution that I can come up with is first seperately renaming df2.accountnr to df2.accountnr2 and then use .drop('date', 'accountnr2')

Is there a better solution to directly drop the joining columns in pyspark?

1

1 Answers

0
votes

Utilize the column names directly as part of your join condition, this requires renaming a column on one of the DataFrames (I will choose df1 for this example). You also no longer have to drop anything afterwards:

dfAll = (
    df1
    .withColumnRenamed('order_date', 'date')
    .join(df2, 
        ['date', 'accountnr'], 
        how = 'left')
)