1
votes

I currently started using pyspark. I have a two columns dataframe with one column containing some nulls, e.g.

df1
A             B
1a3b          7
0d4s          12
6w2r          null
6w2r          null
1p4e          null

and another dataframe has the correct mapping, i.e.

df2
A             B
1a3b          7
0d4s          12
6w2r          0
1p4e          3

so I want to fill out the nulls in df1 using df2 s.t. the result is:

A             B
1a3b          7
0d4s          12
6w2r          0
6w2r          0
1p4e          3

in pandas, I would first create a lookup dictionary from df2 then use apply on the df1 to populate the nulls. But I'm not really sure what functions to use in pyspark, most of replacing nulls I saw is based on simple conditions, for example, filling all the nulls to be a single constant value for certain column.

What I have tried is:

from pyspark.sql.functions import when, col

df1.withColumn('B', when(df.B.isNull(), df2.where(df2.B== df1.B).select('A')))

although I was getting AttributeError: 'DataFrame' object has no attribute '_get_object_id'. The logic is to first filter out the nulls then replace it with the column B's value from df2, but I think df.B.isNull() evaluates the whole column instead of single value, which is probably not the right way to do it, any suggestions?

1

1 Answers

2
votes

left join on common column A and selecting appropriate columns should get you your desired output

df1.join(df2, df1.A == df2.A, 'left').select(df1.A, df2.B).show(truncate=False)

which should give you

+----+---+
|A   |B  |
+----+---+
|6w2r|0  |
|6w2r|0  |
|1a3b|7  |
|1p4e|3  |
|0d4s|12 |
+----+---+