I've only seen sources say that this helps for RDDs
, so I was wondering if it helped for DataFrames
since the Spark core and spark SQL engines optimize differently.
Let's say table 1 has 6mil records and we're joining to table 2 which has 600mil records. We are joining these two tables on table 2's primary key, 'key2'.
If we plan to do:
table 3 = table1.join(table2, 'key2', 'left_outer'),
Is it worth it to filter down table2's 600mil records with a WHERE table2.key2 IN table1.key2
before the join? And if so, what's the best way to do it? I know the DataFrame
LEFT SEMI JOIN
method is similar to a WHERE IN
filter, but I'd like to know if there are better ways to filter it down.