I use Spark SQL 2.4. We use series of chained Spark temporary views to perform the data transformations.
So, many a times, I run into scenarios where I need to apply a CROSS JOIN
between a large table and other small tables. The small lookup tables/views barely has 1-10 records. However, I still run into performance issues nevertheless where the main table has (say) 10M
records, while the small table has only 1 record.
My question is how to ameliorate performance of the query in such scenarios. Is there any way to speed-up the queries. I have tried the BROADCAST
hint on the lookup tables but didn't get the expected performance improvement compared to without the hint.
Can anyone please share their experience and expertise on the best practices for these cases especially when we do a cross join
with another small table based on ranges (Scenario I) below.
SCENARIO I: Cross Join based on range
create temporary view lkp_tbl as
select agegrpid, lowage, highage
from small_tbl;
select /*+ MAPJOIN(b) */
a.id, a.name, a.age, b.agegrpid
from big_tbl a
inner join lkp_tbl b
WHERE a.age between b.lowage and b.highage; -- cross join based on range
SCENARIO II:
create temporary view big_tbl as
select id, name, age
from employee
where id = '1001';
create temporary view lkp_tbl as -- contains only 1 record
select max(last_prcs_dt) as prcs_dt
from run_table;
select /*+ MAPJOIN(b) */
a.id, a.name, a.age, b.prcs_dt
from big_tbl a
cross join lkp_tbl;
Thanks. Any help is appreciated.