Optimize Join of two large pyspark dataframes

Question

I have two large pyspark dataframes df1 and df2 containing GBs of data. The columns in first dataframe are id1, col1. The columns in second dataframe are id2, col2. The dataframes have equal number of rows. Also all values of id1 and id2 are unique. Also all values of id1 correspond to exactly one value id2.

For. first few entries are as for df1 and df2 areas follows

df1:

 id1 |  col1
  12 |  john
  23 |  chris
  35 |  david

df2:

id2  |   col2
 23  |  lewis
 35  |  boon
 12  |  cena

So I need to join the two dataframes on key id1 and id2. df = df1.join(df2, df1.id1 == df2.id2) I am afraid this may suffer from shuffling. How can I optimize the join operation for this special case?

Shubham Jain Shubham Jain · Accepted Answer · 2020-06-10T08:01:43

To avoid the shuffling at the time of join operation, reshuffle the data based on your id column.

The reshuffle operation will also do a full shuffle but it will optimize your further joins if there are more than one.

df1 = df1.repartition('id1')
df2 = df2.repartition('id2')

Another way to avoid shuffles at join is to leverage bucketing.

Save both the dataframes by using bucketBy clause on id then later when you read the dataframes the id column will reside in same executors, hence avoiding the shuffle.

But to leverage benefit of bucketing, you need a hive metastore as the bucketing information is contained in it.

Also this will include additional steps of creating the bucket then reading.

Optimize Join of two large pyspark dataframes

1 Answers