Monitoring Spark-Shell or PySpark-Shell sessions on Dataproc cluster

Question

I understand there are answers for accessing Spark Job Driver Output and Hadoop Jobs from a Dataproc cluster, along with Output from Dataproc Spark job in Google Cloud Logging. Thanks for these.

However, I am also interested in viewing logs for incomplete Spark applications, such as an interactive pyspark-shell or spark-shell sessions—both by:

using the same web interfaces, along with possibly
accessing the the raw session output (a log file on local fs or hdfs?)

During a Spark shell session, while I can view the session as an incomplete application, the UI provides no such information across the Jobs, Stages and Tasks tabs when I am executing commands in the REPL. This can be easily replicated like:

# Launch Dataproc cluster
>> gcloud beta dataproc clusters create $DATAPROC_CLUSTER_NAME

# SSH to master node:
>> gcloud compute ssh "root@$DATAPROC_CLUSTER_NAME-m"

# Launch a Spark shell (e.g., Python) 
>> pyspark

I'm able to see the Spark session as an incomplete application (as noted above), and can execute a basic Spark job (with a collect action), like:

>>> rdd = sc.parallelize([1, 2, 3, 4, 5, 6])
>>> rdd2 = rdd.map(lambda x: x + 1)
>>> rdd2.collect()
[2, 3, 4, 5, 6, 7]
>>> rdd2.persist()
PythonRDD[1] at collect at <stdin>:1

But this results in no information across any of the Jobs, Stages, or Storage tabs: see Spark Job History UI screen grab (blank).

To emphasize: when submitting jobs via the Dataproc API, however, these tabs do show all the expected job history.

Any tips on where I can access such output / job history from the Spark shell session? Many thanks in advance. :)

Patrick Clay Patrick Clay · Accepted Answer · 2016-01-13T23:59:50

Dataproc only provides driver output for Dataproc jobs, that is drivers submitted through the API (usually via the Cloud SDK or the Developer Console). To run spark-shell, you have to ssh into the cluster and run the shell yourself, which will not be tracked as job. It is however still tracked in Web UIs and you can capture the console output yourself.

The Spark History Server only updates when the application finishes. For a live Spark Web UI:

a. Go to the YARN ResourceMangers's Web UI as documented here

b. Find your application (it is probably on top, RUNNING, and name PySparkShell)

c. Click on ApplicationMaster in the final column under Tracking UI.

d. You should see the live Spark Web UI for your application.

In general, I would always recommend viewing Spark and MapReduce jobs through the ResourceManager's WebUI, because it has links to both the currently running and completed job histories.
You can capture the output of the shell into a local log using something like spark-shell |& tee -a shell.log. If you only want log records (and not print statements), you could also use log4j to configure a local file log.

Monitoring Spark-Shell or PySpark-Shell sessions on Dataproc cluster

1 Answers