how to combine two dataframe replacing null values

Question

I have two dataframe. The set of columns in them is slightly different df1:

+---+----+----+----+                                                            
| id|col1|col2|col3|
+---+----+----+----+
|  1|  15|  20|   8|
|  2|   0|null|   5|
+---+----+----+----+

df2:

+---+----+----+----+
| id|col1|col2|col4|
+---+----+----+----+
|  1|  10|  10|  40|
|  2|  10|  30|  50|
+---+----+----+----+

How can pyspark make a left join for df1? But at the same time replace null values with values from df2? And also adding the missing columns from df2

result_df:

id  col1 col2 col3 col4
 1    15   20    8   40
 2     0   30    5   50

I need to combine two data frames with id to get an extra column col4, and for col1, col2, col3, take values from df1, unless the value is non-zero, then replace it with the value from df2.

Shu Shu · Accepted Answer · 2020-05-20T17:24:30

Use coalesce function after the left join.

from pyspark.sql.functions import *
df1.show()
#+---+----+----+----+
#| id|col1|col2|col3|
#+---+----+----+----+
#|  1|  15|  20|   8|
#|  2|   0|null|   5|
#+---+----+----+----+

df2.show()
#+---+----+----+----+----+
#| id|col1|col2|col3|col4|
#+---+----+----+----+----+
#|  1|  15|  20|   8|  40|
#|  2|   0|  30|   5|  50|
#+---+----+----+----+----+

df1.join(df2,["id"],"left").\
select("id",coalesce(df2.col1,df1.col1).alias("col1"),coalesce(df2.col2,df1.col2).alias("col2"),coalesce(df2.col3,df1.col3).alias("col3"),df2.col4).\
show()
+---+----+----+----+----+
| id|col1|col2|col3|col4|
+---+----+----+----+----+
|  1|  15|  20|   8|  40|
|  2|   0|  30|   5|  50|
+---+----+----+----+----+

how to combine two dataframe replacing null values

1 Answers