1
votes

I recently came across this talk about dealing with Skew in Spark SQL by using "Iterative" Broadcast Joins to improve query performance when joining a large table with another not so small table. The talk advises to tackle such scenarios using "Iterative Broadcast Joins". Unfortunately, the talk doesn't probe deep enough for me to understand its implementation.

Hence, I was hoping if someone could please shed some light on how to implement this Iterative Broadcast Join in Spark SQL with few examples. How do I implement the same using Spark SQL queries with the SQL-API ?

Note: I am using Spark SQL query 2.4

Any help is appreciated. Thanks

1

1 Answers

1
votes

Iterative Broadcast Join : large it might be worth considering the approach of iteratively taking slices of your smaller (but not that small) table, broadcasting those, joining with the larger table, then unioning the result.

To Solve this there is concept called:

i) Salting Technique: : In this we add a random number to a key and make data evenly distributed across clusters .Let see this through a example as below

enter image description here

In the above image Suppose we are performing a join on large and small table, data then is divided into three executors x,y and z as below and later union and since we have data skews all X will be in one executor and Y in another executor and z in another executor. Since Y and Z data is relatively small it will get completed and wait for X-executor to complete which will take time. SO to improve performance we should get X-executor data, evenly distributed across all executors Since the data is stuck on one executor we will add a random number to all key (to both large and small table) and execute our process

Adding a random number : Key =explode(key, range(1,3)), which will give key_1,key_2,key_3

enter image description here

Now if you see is evenly distributed across executors, hence provides faster performance

If you need more help,please see this video :

https://www.youtube.com/watch?v=d41_X78ojCg&ab_channel=TechIsland

and this link: https://dzone.com/articles/improving-the-performance-of-your-spark-job-on-ske#:~:text=Iterative%20(Chunked)%20Broadcast%20Join,table%2C%20then%20unioning%20the%20result.