0
votes

I am trying to join two pyspark dataframes like this

joined = df.join(df1,on=["date"],how='left').select([col('df.'+xx) for xx in df.columns] + [col('df1.daily_net_payment_sum'),col('df1.daily_net_payment_avg')])

But it results in

An error was encountered:
"cannot resolve '`df.cust_no`' given input columns: 

Seems to me like I am unable to reference column by their dataframe/table name. Using spark 2.4.7

Any ideas appreciated

1
Can you post the schema of the two DFs please? Or the entire stack trace? Or rhe show statements on both DFs. This question doesn't tells anythingMohd Avais

1 Answers

1
votes

You can achieve this by first creating aliases for each dataset.

df = df.alias("df")
df1 = df1.alias("df1")
joined = df.join(df1,on=["date"],how='left').select([col('df.'+xx) for xx in df.columns] + [col('df1.daily_net_payment_sum'),col('df1.daily_net_payment_avg')])