Joining multiple columns in PySpark

Question

I would like to join two DataFrames that have column names in common.

my DataFrames are as follows:

>>> sample3
DataFrame[uid1: string, count1: bigint]
>>> sample4
DataFrame[uid1: string, count1: bigint]


sample3
     uid1  count1
0  John         3
1  Paul         4
2  George       5

sample4
     uid1  count1
0  John         3
1  Paul         4
2  George       5

(I am using the same DataFrame with a different name on purpose)

I looked at JIRA issue 7197 for Spark and they address how to perform this join (this is inconsistent with the PySpark documentation). However, the method they propose produces duplicate columns:

>>> cond = (sample3.uid1 == sample4.uid1) & (sample3.count1 == sample4.count1)
>>> sample3.join(sample4, cond)
DataFrame[uid1: string, count1: bigint, uid1: string, count1: bigint]

I would like to get a result where the keys do not appear twice.

I can do this with one column:

>>>sample3.join(sample4, 'uid1')
DataFrame[uid1: string, count1: bigint, count1: bigint]

However, the same syntax does not apply to this method of joining and throws an error.

I would like to get the result:

DataFrame[uid1: string, count1: bigint]

I was wondering how this would be possible

Are your values integers that you want to sum? count1_sum = sample3_spark['count1'] + sample4_spark['count1']? — Andy Kubiak
No, I want to figure out how to perform a join. This is not an intermediate step to an end goal — Michal
Looks like it may have been addressed in April. There might be some example syntax you can use on line 471 of python/pyspark/sql/dataframe.py. — Andy Kubiak
Its in line 560 of the file, but doing that produces an error. — Michal

毛骏驰毛骏驰 · Accepted Answer · 2017-08-24T02:23:01

You can define the join cond use a list of keys, in your case:

sample3.join(sample4, ['uid1','count1'])

Joining multiple columns in PySpark

1 Answers