2
votes

I've only seen sources say that this helps for RDDs, so I was wondering if it helped for DataFrames since the Spark core and spark SQL engines optimize differently.

Let's say table 1 has 6mil records and we're joining to table 2 which has 600mil records. We are joining these two tables on table 2's primary key, 'key2'.

If we plan to do:

table 3 = table1.join(table2, 'key2', 'left_outer'),

Is it worth it to filter down table2's 600mil records with a WHERE table2.key2 IN table1.key2 before the join? And if so, what's the best way to do it? I know the DataFrame LEFT SEMI JOIN method is similar to a WHERE IN filter, but I'd like to know if there are better ways to filter it down.

1

1 Answers

1
votes

TL;DR It is not possible to answer without data, but probably not.

Pre-filtering may provide performance boost if you significantly reduce number of records to be shuffled. To do that:

  • It has to be highly selective.
  • Size of the key column is << size of all columns.

The first one is obvious. If there is reduction you search for nothing. The second is subtle - WHERE ... IN (SELECT ... FROM ...) requires a shuffle, same a join. So the keys are actually shuffled twice.

Using bloom filters might scale more gracefully (no need to shuffle).

If you have 100 fold difference in the number of records, it might be better to consider broadcast join.