Dataframe Join Null-Safe Condition Use

Question

I have two dataframes with null values that I'm trying to join using PySpark 2.3.0:

dfA:

# +----+----+
# |col1|col2|
# +----+----+
# |   a|null|
# |   b|   0|
# |   c|   0|
# +----+----+

dfB:

# +----+----+----+
# |col1|col2|col3|
# +----+----+----+
# |   a|null|   x|
# |   b|   0|   x|
# +----+----+----+

The dataframes are creatable with this script:

dfA = spark.createDataFrame(
    [
        ('a', None),
        ('b', '0'),
        ('c', '0')
    ],
    ('col1', 'col2')
)

dfB = spark.createDataFrame(
    [
        ('a', None, 'x'),
        ('b', '0', 'x')
    ],
    ('col1', 'col2', 'col3')
)

Join call:

dfA.join(dfB, dfB.columns[:2], how='left').orderBy('col1').show()

Result:

# +----+----+----+
# |col1|col2|col3|
# +----+----+----+
# |   a|null|null|  <- col3 should be x
# |   b|   0|   x|
# |   c|   0|null|
# +----+----+----+

Expected result:

# +----+----+----+
# |col1|col2|col3|
# +----+----+----+
# |   a|null|   x|  <-
# |   b|   0|   x|
# |   c|   0|null|
# +----+----+----+

It works if I set the first row, col2 to anything other than null, but I need to support null values.

I tried using a condition to compare using null-safe equals as outlined in this post like so:

cond = (dfA.col1.eqNullSafe(dfB.col1) & dfA.col2.eqNullSafe(dfB.col2))
dfA.join(dfB, cond, how='left').orderBy(dfA.col1).show()

Result of null-safe join:

# +----+----+----+----+----+
# |col1|col2|col1|col2|col3|
# +----+----+----+----+----+
# |   a|null|   a|null|   x|
# |   b|   0|   b|   0|   x|
# |   c|   0|null|null|null|
# +----+----+----+----+----+

This retains duplicate columns though, I'm still looking for a way to achieve the expected result at the end of a join.

Shaido Shaido · Accepted Answer · 2018-06-14T02:06:18

A simple solution would be to select the columns that you want to keep. This will let you specify which source dataframe they should come from as well as avoid the duplicate column issue.

dfA.join(dfB, cond, how='left').select(dfA.col1, dfA.col2, dfB.col3).orderBy('col1').show()

Dataframe Join Null-Safe Condition Use

3 Answers