I have a job which consist around 9 sql statement to pull data from hive and write back to hive db. It is currently running for 3hrs which seems too long considering spark abitlity to process data. The application launchs total 11 stages.
I did some analysis using Spark UI and found below grey areas which can be improved:
- Stage 8 in Job 5 has shuffle output of 1.5 TB.
- Time gap between job 4 and job 5 is 20 Mins. I read about this time gap and found spark perform IO out of spark job which reflects as gap between two jobs which can be seen in driver logs.
We have a cluster of 800 nodes with restricted resources for each queue and I am using below conf to submit job:
- -- num-executor 200 -- executor-core 1 -- executor-memory 6G -- deployment mode client
Attaching Image of UI as well.
Now my questions are:
- Where can I find driver log for this job?
- In image, I see a long list of Executor added which I sum is more than 200 but in Executor tab, number is exactly 200. Any explation for this?
- Out of all the stages, only one stage has TASK around 35000 but rest of stages has 200 tasks only. Should I increase number of executor or should I go for dynamic allocation facility of spark?