Iterative Broadcast Join : large it might be worth considering the approach of iteratively taking slices of your smaller (but not that small) table, broadcasting those, joining with the larger table, then unioning the result.
To Solve this there is concept called:
i) Salting Technique: : In this we add a random number to a key and make data evenly distributed across clusters .Let see this through a example as below
In the above image Suppose we are performing a join on large and small table, data then is divided into three executors x,y and z as below and later union and since we have data skews all X will be in one executor and Y in another executor and z in another executor.
Since Y and Z data is relatively small it will get completed and wait for X-executor to complete which will take time.
SO to improve performance we should get X-executor data, evenly distributed across all executors
Since the data is stuck on one executor we will add a random number to all key (to both large and small table) and execute our process
Adding a random number : Key =explode(key, range(1,3)), which will give key_1,key_2,key_3
Now if you see is evenly distributed across executors, hence provides faster performance
If you need more help,please see this video :
https://www.youtube.com/watch?v=d41_X78ojCg&ab_channel=TechIsland
and this link:
https://dzone.com/articles/improving-the-performance-of-your-spark-job-on-ske#:~:text=Iterative%20(Chunked)%20Broadcast%20Join,table%2C%20then%20unioning%20the%20result.