Cross Join in Spark SQL

Question

I use Spark SQL 2.4. We use series of chained Spark temporary views to perform the data transformations.

So, many a times, I run into scenarios where I need to apply a CROSS JOIN between a large table and other small tables. The small lookup tables/views barely has 1-10 records. However, I still run into performance issues nevertheless where the main table has (say) 10M records, while the small table has only 1 record.

My question is how to ameliorate performance of the query in such scenarios. Is there any way to speed-up the queries. I have tried the BROADCAST hint on the lookup tables but didn't get the expected performance improvement compared to without the hint.

Can anyone please share their experience and expertise on the best practices for these cases especially when we do a cross join with another small table based on ranges (Scenario I) below.

SCENARIO I: Cross Join based on range

create temporary view lkp_tbl as
select agegrpid, lowage, highage
from small_tbl;

select /*+ MAPJOIN(b) */ 
a.id, a.name, a.age, b.agegrpid
from big_tbl a
inner join lkp_tbl b 
WHERE a.age between b.lowage and b.highage; -- cross join based on range

SCENARIO II:

create temporary view big_tbl as
select id, name, age
from employee
where id = '1001';

create temporary view lkp_tbl as -- contains only 1 record
select max(last_prcs_dt) as prcs_dt
from run_table; 

select /*+ MAPJOIN(b) */
a.id, a.name, a.age, b.prcs_dt
from big_tbl a
cross join lkp_tbl;

Thanks. Any help is appreciated.

marie20 marie20 · Accepted Answer · 2020-08-06T06:39:16

Broadcasting seems to be working fine.

The following discussion seems to shed some light on the topic -

Cross Join for calculation in Spark SQL

Cross Join in Spark SQL

1 Answers