Join PySpark SQL DataFrames that are already partitioned in a subset of the keys

Question

I want to join 2 Spark DataFrames that is already partitioned in a subset of the keys I use to join. But when I do it, Exchange operations still occur anyway. How can I join them without Exchange or Broadcast.

For example, I have DataFrame df1 and df2. They both have the same columns that are col1, col2, col3. And they both have already been partitioned using col1. I want to join them using col1 and col2. But when I do it, It get repartitioned again using col1 and col2.

@mck The data is already in the partition of col1 so I don't what to shuffle it. — Adisak Rungruang

Mageswaran Mageswaran · Accepted Answer · 2021-03-18T16:50:03

AFAIK,the dataframe needs to be partitioned by same column on both sides, for one shuffle.

Eg:

right_df = right_df.repartition(400)
left_df = left_df.repartition(400) # will lead to one more shuffle when join operation is used down the line
df = left_df.join(right_df, col("id") == col("user_id"), "outter")

left_df = left_df.withColumnRenamed("id", "repartition_id").repartition(400, col("repartition_id")
right_df = right_df.withColumnRenamed("user_id", "repartition_id").repartition(400, col("repartition_id")
df = left_df.join(right_df, "repartition_id", "outter")

Join PySpark SQL DataFrames that are already partitioned in a subset of the keys

1 Answers