0
votes

I want to join 2 Spark DataFrames that is already partitioned in a subset of the keys I use to join. But when I do it, Exchange operations still occur anyway. How can I join them without Exchange or Broadcast.

For example, I have DataFrame df1 and df2. They both have the same columns that are col1, col2, col3. And they both have already been partitioned using col1. I want to join them using col1 and col2. But when I do it, It get repartitioned again using col1 and col2.

1
can you partition by both col1 and col2?mck
@mck The data is already in the partition of col1 so I don't what to shuffle it.Adisak Rungruang

1 Answers

0
votes

AFAIK,the dataframe needs to be partitioned by same column on both sides, for one shuffle.

Eg:

right_df = right_df.repartition(400)
left_df = left_df.repartition(400) # will lead to one more shuffle when join operation is used down the line
df = left_df.join(right_df, col("id") == col("user_id"), "outter")

enter image description here

left_df = left_df.withColumnRenamed("id", "repartition_id").repartition(400, col("repartition_id")
right_df = right_df.withColumnRenamed("user_id", "repartition_id").repartition(400, col("repartition_id")
df = left_df.join(right_df, "repartition_id", "outter")

enter image description here