I have two data frames in pyspark, I am checking data in dataframe A and if column is null then replace null data by same column in dataframe B.
Both dataframes have unique ID column, according to that i am joining dataframes and below codes are working fine.
updated_data = TABLE_BY_updated_date_unique.select('name_id_forwarded','name_id','name_id_org','first','last','passport','PHONE','EMAIL')
most_attributes_data = Most_attributes.select('name_id_forwarded','name_id','name_id_org','first','last','passport','PHONE','EMAIL')
final_df = updated_data.alias('a').join(most_attributes_data.alias('b'), on=['name_id_forwarded'], how='left')\
.select(
'a.name_id_forwarded','a.name_id','a.name_id_org',
f.when(f.isnull(f.col('a.first')),f.col('b.first')).otherwise(f.col('a.first')).alias('first'),
f.when(f.isnull(f.col('a.last')),f.col('b.last')).otherwise(f.col('a.last')).alias('last'),
f.when(f.isnull(f.col('a.passport')),f.col('b.passport')).otherwise(f.col('a.passport')).alias('passport'),
f.when(f.isnull(f.col('a.PHONE')),f.col('b.PHONE')).otherwise(f.col('a.PHONE')).alias('PHONE'),
f.when(f.isnull(f.col('a.EMAIL')),f.col('b.EMAIL')).otherwise(f.col('a.EMAIL')).alias('EMAIL')
)
I have more than 40 columns and i don't want to repeat below codes for each column. f.when(f.isnull(f.col('a.EMAIL')),f.col('b.EMAIL')).otherwise(f.col('a.EMAIL')).alias('EMAIL')
- Can you please help me to loop this syntax so that i can read all the columns without repeating*