I'm running a PySpark job in Google Cloud Dataproc, in a cluster with half the nodes being preemptible, and seeing several errors in the job output (the driver output) such as:
...spark.scheduler.TaskSetManager: Lost task 9696.0 in stage 0.0 ... Python worker exited unexpectedly (crashed)
...
Caused by java.io.EOFException
...
...YarnSchedulerBackend$YarnSchedulerEndpoint: Requesting driver to remove executor 177 for reason Container marked as failed: ... Exit status: -100. Diagnostics: Container released on a *lost* node
...spark.storage.BlockManagerMasterEndpoint: Error try to remove broadcast 3 from block manager BlockManagerId(...)
Perhaps by coincidence, the errors mostly seem to be coming from preemptible nodes.
My suspicion is that these opaque errors are coming from the node or executors running out of memory, but there don't seem to be any granular memory related metrics exposed by Dataproc.
How can I determine why a node was considered lost? Is there a way I can inspect memory usage per node or executor to validate whether these errors are being caused by high memory usage? If YARN is the one which is killing containers / determining nodes are lost, then hopefully there's a way to introspect why?