4
votes

We have few spark batch jobs and streaming jobs. Spark batch jobs are running on Google cloud VM and Spark streaming jobs are running on Google Dataproc cluster. It is becoming difficult to manage the jobs. So we wanted to implement some mechanism to monitor the jobs' health. Our basic requirement is to know :

  1. What time job started and how much time it took for processing the data.
  2. How many records affected.
  3. Send alert if there is any error.
  4. Visualize the above metrics everyday and take action if required.

I am not well versed with spark domain. I explored the stackdriver logging in Google Dataproc but did not find the logs for streaming jobs on dataproc clusters. I know ELK stack can be used but I wanted to know what is the best practices in spark ecosystem for such kind of requirement. Thanks.

2

2 Answers

1
votes

Google Cloud Dataproc writes logs and pushes metrics to Google Stackdriver which you can use for monitoring and alerting.

Take a look at documentation on how to use Dataproc with Stackdriver: https://cloud.google.com/dataproc/docs/guides/stackdriver-monitoring

0
votes

Adding to what Igor said.

There are metrics in stackdriver for basic things like success/failure and duration, however, nothing like #2.

You can follow this example to create a SparkListener and then report the metrics to Stackdriver API directly.