Whenever I set up a Google Dataproc cluster with Stackdriver monitoring and a monitoring agent, I have noticed that Stackdriver just loses connection whenever Dataproc gets a job. On the stackdriver UI, it has a latency value that they say should not be higher than 2 minutes in most cases. This value for me is simply the time since I submitted the job (often hours) and there are no available metrics that cannot be seen in the Compute Engine webpage.
Is there a way to get stackdriver monitoring working with dataproc? I would like to be able to monitor the RAM usage of my jobs if possible.
Stackdriver monitoring is run and set up by my organization but they seem to have access to all of the features. We do not use an HTTP proxy. The monitoring agent is set up using the commands found in Google's documentation. I have a startup-script (--initialization-actions flag) that runs for both the master and workers that looks like this:
#!/bin/bash
cd /tmp
curl -O "https://repo.stackdriver.com/stack-install.sh"
bash stack-install.sh --write-gcm
# ... other initialization stuffs
EDIT: The "other initialization stuffs" is just a couple of gsutil copy commands to get some resource files onto the local machines if that makes a difference.
I have tried moving the install of the agent to after the other commands and I only use /tmp because Google recommends using absolute paths when copying files (forgot where the documentation is for this, but it has helped me before).
Here's a screenshot as requested of what I'm seeing in stackdriver. Notice how the all of the metrics other than CPU usage stop at the vertical line. That is when the job was submitted to spark today:
Results of grep stackdriver-agent /var/logs/syslog:
Sep 2 13:31:53 <cluster-name>-m stackdriver-agent[3609]: Starting Stackdriver metrics collection agent: stackdriver-agentoption = Hostname; value = 3431688934917455875;
Sep 2 13:31:53 <cluster-name>-m stackdriver-agent[3609]: option = Interval; value = 60.000000;
Sep 2 13:31:53 <cluster-name>-m stackdriver-agent[3609]: Created new plugin context.
Sep 2 13:31:53 <cluster-name>-m stackdriver-agent[3609]: option = PIDFile; value = /var/run/stackdriver-agent.pid;
Sep 2 13:31:53 <cluster-name>-m stackdriver-agent[3609]: option = Hostname; value = 3431688934917455875;
Sep 2 13:31:53 <cluster-name>-m stackdriver-agent[3609]: option = Interval; value = 60.000000;
Sep 2 13:31:53 <cluster-name>-m stackdriver-agent[3609]: Created new plugin context.
Sep 2 13:31:53 <cluster-name>-m stackdriver-agent[3609]: .
Sep 2 13:31:56 <cluster-name>-m stackdriver-agent[3823]: Stopping Stackdriver metrics collection agent: stackdriver-agent.
Sep 2 13:31:56 <cluster-name>-m stackdriver-agent[3842]: Starting Stackdriver metrics collection agent: stackdriver-agentoption = Interval; value = 60.000000;
Sep 2 13:31:56 <cluster-name>-m stackdriver-agent[3842]: Created new plugin context.
Sep 2 13:31:56 <cluster-name>-m stackdriver-agent[3842]: option = PIDFile; value = /var/run/stackdriver-agent.pid;
Sep 2 13:31:56 <cluster-name>-m stackdriver-agent[3842]: option = Interval; value = 60.000000;
Sep 2 13:31:56 <cluster-name>-m stackdriver-agent[3842]: Created new plugin context.
Sep 2 13:31:56 <cluster-name>-m stackdriver-agent[3842]: .
EDIT: Full cluster creation command is:
gcloud dataproc clusters create <cluster-name> --master-machine-type n1-highmem-2 --worker-machine-type n1-highmem-2 --initialization-actions <path-to-script> --master-boot-disk-size 50GB --worker-boot-disk-size 50GB --num-workers 16 --network internal --zone us-east1-c --scopes https://www.googleapis.com/auth/cloud-platform --project <project-name> --tags dataproc
The dataproc tag opens up the firewall on all ports in my organization. internal network was found to work better than default
Results of sudo systemctl | grep stackdriver-agent:
stackdriver-agent.service active running
LSB: start and stop Stackdriver Agent
Results of sudo ps wwaux | grep stackdriver-agent:
root 3851 0.0 0.0 1004704 9096 ? Ssl 12:50 0:00 /opt/stackdriver/collectd/sbin/stackdriver-collectd -C
/opt/stackdriver/collectd/etc/collectd.conf -P /var/run/stackdriver-agent.pid
7053 0.0 0.0 12732 2068 pts/0 S+ 13:14 0:00 grep stackdriver-agent

grep stackdriver-agent /var/log/syslog? - Dennis Huosudo systemctland/orsudo ps wwauxon both master and worker nodes? - Dennis Huo