0
votes

I have a job which consist around 9 sql statement to pull data from hive and write back to hive db. It is currently running for 3hrs which seems too long considering spark abitlity to process data. The application launchs total 11 stages.

I did some analysis using Spark UI and found below grey areas which can be improved:

  • Stage 8 in Job 5 has shuffle output of 1.5 TB.
  • Time gap between job 4 and job 5 is 20 Mins. I read about this time gap and found spark perform IO out of spark job which reflects as gap between two jobs which can be seen in driver logs.

We have a cluster of 800 nodes with restricted resources for each queue and I am using below conf to submit job:

  • -- num-executor 200 -- executor-core 1 -- executor-memory 6G -- deployment mode client

Attaching Image of UI as well.

Now my questions are:

  • Where can I find driver log for this job?
  • In image, I see a long list of Executor added which I sum is more than 200 but in Executor tab, number is exactly 200. enter image description hereAny explation for this?
  • Out of all the stages, only one stage has TASK around 35000 but rest of stages has 200 tasks only. Should I increase number of executor or should I go for dynamic allocation facility of spark?
1
Are these 9 separate SQL statements to same or N tables in HDFS?thebluephantom
First two SQL pull data from main tables to one common table. Rest of the query use common table to generate output as per filter condition.pandi

1 Answers

0
votes

Below are the thought processes that may guide you to some extent:

Is it necessary to have one core per executor? The executor need not be fat always. You can have more cores in one executor. it is a trade-off between creating a slim vs fat executors.

Configure shuffle partition parameter spark.sql.shuffle.partitions

Ensure while reading data from Hive, you are using Sparksession (basically HiveContext). This will pull the data into Spark memory from HDFS and schema information from Metastore of Hive.

Yes, Dynamic allocation of resources is a feature that helps in allocating the right set of resources. It is better than having fixed allocation.