6
votes

I'm running a PySpark job in Google Cloud Dataproc, in a cluster with half the nodes being preemptible, and seeing several errors in the job output (the driver output) such as:

...spark.scheduler.TaskSetManager: Lost task 9696.0 in stage 0.0 ... Python worker exited unexpectedly (crashed)
   ...
Caused by java.io.EOFException
   ...

...YarnSchedulerBackend$YarnSchedulerEndpoint: Requesting driver to remove executor 177 for reason Container marked as failed: ... Exit status: -100. Diagnostics: Container released on a *lost* node

...spark.storage.BlockManagerMasterEndpoint: Error try to remove broadcast 3 from block manager BlockManagerId(...)

Perhaps by coincidence, the errors mostly seem to be coming from preemptible nodes.

My suspicion is that these opaque errors are coming from the node or executors running out of memory, but there don't seem to be any granular memory related metrics exposed by Dataproc.

How can I determine why a node was considered lost? Is there a way I can inspect memory usage per node or executor to validate whether these errors are being caused by high memory usage? If YARN is the one which is killing containers / determining nodes are lost, then hopefully there's a way to introspect why?

2

2 Answers

1
votes

Because you are using Preemptible VMs which are short-lived and guaranteed to last for up to 24 hours. This means that when GCE shutdowns Preemptible VMs you see errors like this:

YarnSchedulerBackend$YarnSchedulerEndpoint: Requesting driver to remove executor 177 for reason Container marked as failed: ... Exit status: -100. Diagnostics: Container released on a lost node

0
votes

Open a secure shell from your machine to the cluster. You'll require gcloud sdk installed for that.

gcloud compute ssh ${HOSTNAME}-m --project=${PROJECT}

Then run the following commands in the cluster.

List all nodes in the cluster

yarn node -list 

Then using ${NodeID} to get report on the node state.

yarn node -status ${NodeID}

You could also set up local port forwarding via SSH to Yarn WebUI server instead of running commands directly in the cluster.

gcloud compute ssh ${HOSTNAME}-m \ 
    --project=${PROJECT} -- \ 
    -L 8088:${HOSTNAME}-m:8088 -N

Then go to http://localhost:8088/cluster/apps in your browser.