pyspark join 2 tables and change a column value only if 2nd table has an entry

Question

I have table_1 and table_2 as below:

table_1

ID  name   qty
1   ball   34
2   pen    45
5   ham    22
4   van    1
9   phone  200

table_2 with changed_qty

ID  name   changed_qty
1   ball   70
5   ham    400
9   phone  89

I want to join table_1 and table_2 based on columns ['ID', 'name'] such that if ID and name are not available in the 2nd table, then I want to retain table_1 row itself. If ID and name are available in table_2, then I want to pull the qty column from the second table.

Expected results:

ID  name   qty
1   ball   70
2   pen    45
5   ham    400
4   van    1
9   phone  89

Normal left join wouldn't provide me the expected results.

df_final = df_table_1.join(df_table_2, ['ID', 'name'], how="left")

mck mck · Accepted Answer · 2021-03-01T15:04:44

Coalesce the two columns, i.e. replacing nulls in changed_qty with qty:

import pyspark.sql.functions as F

final = table1.join(table2, ['ID', 'name'], 'left').select('ID', 'name', F.coalesce('changed_qty', 'qty').alias('qty'))

final.show()
+---+-----+---+
| ID| name|qty|
+---+-----+---+
|  1| ball| 70|
|  2|  pen| 45|
|  5|  ham|400|
|  4|  van|  1|
|  9|phone| 89|
+---+-----+---+

pyspark join 2 tables and change a column value only if 2nd table has an entry

1 Answers