spark join operation for two data frame

Question

when df1 and df2 has the same rows and
df1 and df2 has no duplicated value
what is the complexity for join operation df1.join(df2)?
my quess is to take O(n^2)

and is it possible to sort both the data frame and make it better performance? if it's not what is the way to make a join faster im pyspark?

ravi malhotra ravi malhotra · Accepted Answer · 2019-09-20T08:15:54

Even if df1 and df2 have same set of rows and if they are not partitioned, for joining them spark has to partition both the data frames on the join key. For spark 2.3 onwards, sort-merge joins the default join workhorse which would require both the data frames to be partitioned and sorted by the join key and then the join is performed. Both the data frames also have to be colocated for sort-merge join.

and is it possible to sort both the data frame and make it better performance? if it's not what is the way to make a join faster im pyspark?

Yes, if you see that a particular data frame is used again and again in a join using the same join key then you can repartition the data frame on the join key and cache it for further use. Please refer below link for more details

https://deepsense.ai/optimize-spark-with-distribute-by-and-cluster-by/

spark join operation for two data frame

1 Answers