2
votes

I am running Dataproc and submitting Spark Jobs using the default client-mode. The logs for the jobs are visible in the GCP console and is available in the GCS bucket. However, I would like to see the logs in Stackdriver Logging.

Currently, the only way I found was to use cluster-mode instead.

Is there a way to push logs to Stackdriver when using client-mode?

1

1 Answers

5
votes

This is something the Dataproc team is actively working on and should have a solution for you sometime soon. If you want to file a public feature request for tracking this that is an option, but I will try to update this response when this feature is usable by you.

Digging into it a bit, the reason why you can see the logs when using cluster-mode is that we have Fluentd configurations that pick up YARN container logs (userlogs) by default. When running in cluster-mode the driver runs in a YARN container and those logs are picked up by that configuration.

Currently, output produced by the driver is forwarded directly to GCS by the Dataproc agent. In the future there will be an option to have all driver output sent to Stackdriver when starting a cluster.

Update:

This feature is now in Beta and is stable to use. When creating a Cluster, the property "dataproc:dataproc.logging.stackdriver.job.driver.enable" can be used to toggle whether the cluster will send Job driver logs to Stackdriver. Additionally you can use the property "dataproc:dataproc.logging.stackdriver.job.yarn.container.enable" to have the cluster associate YARN container logs with the Job they were created by instead of the Cluster they ran on.

Documentation is available here