I would like to join two DataFrames that have column names in common.
my DataFrames are as follows:
>>> sample3
DataFrame[uid1: string, count1: bigint]
>>> sample4
DataFrame[uid1: string, count1: bigint]
sample3
uid1 count1
0 John 3
1 Paul 4
2 George 5
sample4
uid1 count1
0 John 3
1 Paul 4
2 George 5
(I am using the same DataFrame with a different name on purpose)
I looked at JIRA issue 7197 for Spark and they address how to perform this join (this is inconsistent with the PySpark documentation). However, the method they propose produces duplicate columns:
>>> cond = (sample3.uid1 == sample4.uid1) & (sample3.count1 == sample4.count1)
>>> sample3.join(sample4, cond)
DataFrame[uid1: string, count1: bigint, uid1: string, count1: bigint]
I would like to get a result where the keys do not appear twice.
I can do this with one column:
>>>sample3.join(sample4, 'uid1')
DataFrame[uid1: string, count1: bigint, count1: bigint]
However, the same syntax does not apply to this method of joining and throws an error.
I would like to get the result:
DataFrame[uid1: string, count1: bigint]
I was wondering how this would be possible
count1_sum = sample3_spark['count1'] + sample4_spark['count1']? - Andy Kubiak