1
votes

I am executing a spark (sql) job which has lots of stages (~150). It is written using spark-sql primarily within an internal framework that chains the SQL's using temporary views and dataframes. For initial intermediate table writes, I can see a detailed view in Spark UI -> SQL tab. But for the later table writes, the SQL tab just shows a UI of below form.

What is the reason for this and can I use any parameter to get a detailed graphical view in the SQL tab?

My spark version: 2.3

EDIT:17 Jan 2020 I found a JIRA https://issues.apache.org/jira/browse/SPARK-30064, but I am not sure if it's related since that is mentioning jdbc datasource which I am not using.

enter image description here

1
Maybe you are writing to a JDBC DataSource? - mazaneicha

1 Answers

0
votes

Check out https://spark.apache.org/docs/2.3.4/configuration.html#spark-ui specifically I suspect for this issue you may have spark.ui.retainedStages (default 1000) and/or spark.ui.retainedTasks (default 100k) set too low.

If your job has 150 stages, and for example, each stage has 1000 tasks on average, then your whole job would have 150*1000 = 150k tasks, which is over default 100k limit. So you would not see in Spark UI those older tasks / stages etc.

PS. Also for Spark with such large number of stages (e.g. when you have a lot of dataframes etc chained in iteratively), we often find that creating checkpoints helps a lot. E.g. you could checkpoint for example every 20-50 iterations (if there is a loop that creates that huge lineage; play with the number that works best for your case), so you essentially split up that huge job with 150 stages into chunks of 20-50 stages. Spark Optimizer may have hard time going through a DAG of 150 dataframes to create an optimal plan etc.

https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-checkpointing.html