1
votes

Whenever I set up a Google Dataproc cluster with Stackdriver monitoring and a monitoring agent, I have noticed that Stackdriver just loses connection whenever Dataproc gets a job. On the stackdriver UI, it has a latency value that they say should not be higher than 2 minutes in most cases. This value for me is simply the time since I submitted the job (often hours) and there are no available metrics that cannot be seen in the Compute Engine webpage.

Is there a way to get stackdriver monitoring working with dataproc? I would like to be able to monitor the RAM usage of my jobs if possible.

Stackdriver monitoring is run and set up by my organization but they seem to have access to all of the features. We do not use an HTTP proxy. The monitoring agent is set up using the commands found in Google's documentation. I have a startup-script (--initialization-actions flag) that runs for both the master and workers that looks like this:

#!/bin/bash
cd /tmp
curl -O "https://repo.stackdriver.com/stack-install.sh"
bash stack-install.sh --write-gcm
# ... other initialization stuffs

EDIT: The "other initialization stuffs" is just a couple of gsutil copy commands to get some resource files onto the local machines if that makes a difference.

I have tried moving the install of the agent to after the other commands and I only use /tmp because Google recommends using absolute paths when copying files (forgot where the documentation is for this, but it has helped me before).

Here's a screenshot as requested of what I'm seeing in stackdriver. Notice how the all of the metrics other than CPU usage stop at the vertical line. That is when the job was submitted to spark today:

Stackdriver screenshot

Results of grep stackdriver-agent /var/logs/syslog:

Sep  2 13:31:53 <cluster-name>-m stackdriver-agent[3609]: Starting Stackdriver metrics collection agent: stackdriver-agentoption = Hostname; value = 3431688934917455875;
Sep  2 13:31:53 <cluster-name>-m stackdriver-agent[3609]: option = Interval; value = 60.000000;
Sep  2 13:31:53 <cluster-name>-m stackdriver-agent[3609]: Created new plugin context.
Sep  2 13:31:53 <cluster-name>-m stackdriver-agent[3609]: option = PIDFile; value = /var/run/stackdriver-agent.pid;
Sep  2 13:31:53 <cluster-name>-m stackdriver-agent[3609]: option = Hostname; value = 3431688934917455875;
Sep  2 13:31:53 <cluster-name>-m stackdriver-agent[3609]: option = Interval; value = 60.000000;
Sep  2 13:31:53 <cluster-name>-m stackdriver-agent[3609]: Created new plugin context.
Sep  2 13:31:53 <cluster-name>-m stackdriver-agent[3609]: .
Sep  2 13:31:56 <cluster-name>-m stackdriver-agent[3823]: Stopping Stackdriver metrics collection agent: stackdriver-agent.
Sep  2 13:31:56 <cluster-name>-m stackdriver-agent[3842]: Starting Stackdriver metrics collection agent: stackdriver-agentoption = Interval; value = 60.000000;
Sep  2 13:31:56 <cluster-name>-m stackdriver-agent[3842]: Created new plugin context.
Sep  2 13:31:56 <cluster-name>-m stackdriver-agent[3842]: option = PIDFile; value = /var/run/stackdriver-agent.pid;
Sep  2 13:31:56 <cluster-name>-m stackdriver-agent[3842]: option = Interval; value = 60.000000;
Sep  2 13:31:56 <cluster-name>-m stackdriver-agent[3842]: Created new plugin context.
Sep  2 13:31:56 <cluster-name>-m stackdriver-agent[3842]: .

EDIT: Full cluster creation command is:

gcloud dataproc clusters create <cluster-name> --master-machine-type n1-highmem-2 --worker-machine-type n1-highmem-2 --initialization-actions <path-to-script> --master-boot-disk-size 50GB --worker-boot-disk-size 50GB --num-workers 16 --network internal --zone us-east1-c --scopes https://www.googleapis.com/auth/cloud-platform --project <project-name> --tags dataproc

The dataproc tag opens up the firewall on all ports in my organization. internal network was found to work better than default

Results of sudo systemctl | grep stackdriver-agent:

stackdriver-agent.service      active running   
LSB: start and stop Stackdriver Agent

Results of sudo ps wwaux | grep stackdriver-agent:

root      3851  0.0  0.0 1004704 9096 ?        Ssl  12:50   0:00 /opt/stackdriver/collectd/sbin/stackdriver-collectd -C 
/opt/stackdriver/collectd/etc/collectd.conf -P /var/run/stackdriver-agent.pid
7053  0.0  0.0  12732  2068 pts/0    S+   13:14   0:00 grep stackdriver-agent
1
How are you setting up Stackdriver monitoring and your monitoring agent? Any chance you can share a screenshot of what you're seeing in the UI? - Dennis Huo
Edited the original post, let me know if you need any more information - jbird
If you happen to have a cluster up still where you did this, can you check grep stackdriver-agent /var/log/syslog? - Dennis Huo
@DennisHuo added the log. It looks like it stops it, then starts again. Not sure what is happening there but the only scripts relating to stackdriver are what is posted above in the start-up script - jbird
Thanks. The stop/start looks normal as far as I can tell. Do you also see stackdriver-agent running if you run sudo systemctl and/or sudo ps wwaux on both master and worker nodes? - Dennis Huo

1 Answers

0
votes

I repro'd some of your steps, and though I can't say why it might look like the monitoring "works" until you submit a job, since this was the first thing I ran into when trying to just apply the instructions without debugging the internals of Dataproc, you should verify that you're giving the right scopes to your Dataproc cluster to enable the stackdriver-agent to write its metrics into the API. Namely, the following seemed to work for me, keeping the init action the same:

gcloud dataproc clusters create dhuo-stackdriver \
    --initialization-actions gs://<my-bucket>/install_stackdriver.sh \
    --scopes https://www.googleapis.com/auth/monitoring.write

Alternatively, you can use other scopes listed in the Stackdriver documentation such as the broader cloud-platform scope. Note that this may override some other default scope mixins, which are normally added if no user-specified scopes are used:

https://www.googleapis.com/auth/bigquery
https://www.googleapis.com/auth/bigtable.admin.table
https://www.googleapis.com/auth/bigtable.data
https://www.googleapis.com/auth/devstorage.full_control

My local test with just your snippet as an init action:

#!/bin/bash
cd /tmp
curl -O "https://repo.stackdriver.com/stack-install.sh"
bash stack-install.sh --write-gcm

plus the https://www.googleapis.com/auth/monitoring.write worked in my test project including through job submission:

Stackdriver page for Dataproc cluster