I understand there are answers for accessing Spark Job Driver Output and Hadoop Jobs from a Dataproc cluster, along with Output from Dataproc Spark job in Google Cloud Logging. Thanks for these.
However, I am also interested in viewing logs for incomplete Spark applications, such as an interactive pyspark-shell
or spark-shell
sessions—both by:
- using the same web interfaces, along with possibly
- accessing the the raw session output (a log file on local fs or hdfs?)
During a Spark shell session, while I can view the session as an incomplete application, the UI provides no such information across the Jobs, Stages and Tasks tabs when I am executing commands in the REPL. This can be easily replicated like:
# Launch Dataproc cluster
>> gcloud beta dataproc clusters create $DATAPROC_CLUSTER_NAME
# SSH to master node:
>> gcloud compute ssh "root@$DATAPROC_CLUSTER_NAME-m"
# Launch a Spark shell (e.g., Python)
>> pyspark
I'm able to see the Spark session as an incomplete application (as noted above), and can execute a basic Spark job (with a collect
action), like:
>>> rdd = sc.parallelize([1, 2, 3, 4, 5, 6])
>>> rdd2 = rdd.map(lambda x: x + 1)
>>> rdd2.collect()
[2, 3, 4, 5, 6, 7]
>>> rdd2.persist()
PythonRDD[1] at collect at <stdin>:1
But this results in no information across any of the Jobs, Stages, or Storage tabs: see Spark Job History UI screen grab (blank).
To emphasize: when submitting jobs via the Dataproc API, however, these tabs do show all the expected job history.
Any tips on where I can access such output / job history from the Spark shell session? Many thanks in advance. :)