Optimizing a Cross Join in Spark SQL

Question

Is it possible to optimize Cross Joins in Spark SQL ? The requirement is to populate a column band_id based on age-ranges defined in another table. So far I have been able to implement the same thru a Cross Join and WHERE clause. But, I was hoping if there is a better way to code this and alleviate performance issues. Can I use a broadcast hint ? (sql provided below)

Customer: (10 M records)

id | name | age
X1 | John | 22
V2 | Mark | 29
F4 | Peter| 42

Age_band table: (10 records)

band_id | low_age | high_age
B123    |  10     | 19
X745    |  20     | 29
P134    |  30     | 39
Q245    |  40     | 50

Expected Output:

id | name | age | band_id
X1 | John | 22  | X745
V2 | Mark | 29  | X745
F4 | Peter| 42  | Q245

Query:

select
from cust a
cross join age_band b
where a.age between b.low_age and b.high_age;

Please advise.

mazaneicha mazaneicha · Accepted Answer · 2020-07-31T18:31:50

From SparkStrategies.scala source, it seems like in your case you can but you don't have to specify either cross or broadcast hint, because Broadcast Nested Loop Join is what Spark will select regardless:

   * ...
   * - Broadcast nested loop join (BNLJ):
   *     Supports both equi-joins and non-equi-joins.
   *     Supports all the join types, but the implementation is optimized for:
   *       1) broadcasting the left side in a right outer join;
   *       2) broadcasting the right side in a left outer, left semi, left anti or existence join;
   *       3) broadcasting either side in an inner-like join.
   *     For other cases, we need to scan the data multiple times, which can be rather slow. 
   * ...

Optimizing a Cross Join in Spark SQL

2 Answers