pyspark: referencing columns by dataframe during a join

Question

I am trying to join two pyspark dataframes like this

joined = df.join(df1,on=["date"],how='left').select([col('df.'+xx) for xx in df.columns] + [col('df1.daily_net_payment_sum'),col('df1.daily_net_payment_avg')])

But it results in

An error was encountered:
"cannot resolve '`df.cust_no`' given input columns:

Seems to me like I am unable to reference column by their dataframe/table name. Using spark 2.4.7

Any ideas appreciated

Can you post the schema of the two DFs please? Or the entire stack trace? Or rhe show statements on both DFs. This question doesn't tells anything — Mohd Avais

ggordon ggordon · Accepted Answer · 2021-04-17T02:37:11

You can achieve this by first creating aliases for each dataset.

df = df.alias("df")
df1 = df1.alias("df1")
joined = df.join(df1,on=["date"],how='left').select([col('df.'+xx) for xx in df.columns] + [col('df1.daily_net_payment_sum'),col('df1.daily_net_payment_avg')])

pyspark: referencing columns by dataframe during a join

1 Answers