I have two large pyspark dataframes df1 and df2 containing GBs of data. The columns in first dataframe are id1, col1. The columns in second dataframe are id2, col2. The dataframes have equal number of rows. Also all values of id1 and id2 are unique. Also all values of id1 correspond to exactly one value id2.
For. first few entries are as for df1 and df2 areas follows
df1:
id1 | col1
12 | john
23 | chris
35 | david
df2:
id2 | col2
23 | lewis
35 | boon
12 | cena
So I need to join the two dataframes on key id1 and id2. df = df1.join(df2, df1.id1 == df2.id2) I am afraid this may suffer from shuffling. How can I optimize the join operation for this special case?