I have a system that accepts jobs from users. These jobs run as spark jobs on dataproc. During the day theres a lot of jobs running but at night there may not be any. I'm wondering whats the best way to terminate the cluster during these periods of downtime and either restart or re-create a cluster once a new job is received? The goal here is to not be charged during periods of inactivity.
3 Answers
Dataproc now natively supports scheduled cluster deletion. You can schedule clusters to be deleted at a particular time (e.g. 7pm), or if they've been idle for a period of time (e.g. 1h).
You can also check out cluster autoscaling.
https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/scheduled-deletion
Looks like you're looking for --max-idle option that can be used during cluster creation.
From the docs:
--max-idle : The duration from the moment when the cluster enters the idle state to the moment when the cluster starts to delete. Provide the duration in IntegerUnit format, where the unit can be "s, m, h, d" (seconds, minutes, hours, days, respectively). Examples: "30m" or "1d" (30 minutes or 1 day from when the cluster becomes idle). 1 second 10 minutes 14 days
--expiration-time : The time to start deleting the cluster in ISO 8601 datetime format. An easy way to generate the datetime in correct format is through the Timestamp Generator. For example, "2017-08-22T13:31:48-08:00" specifies an expiration time of 13:21:48 in the UTC -8:00 time zone. 1 second 10 minutes from the current time 14 days from the current time
--max-age : The duration from the moment of submitting the cluster create request to the moment when the cluster starts to delete. Provide the duration in IntegerUnit format, where the unit can be "s, m, h, d" (seconds, minutes, hours, days, respectively). Examples: "30m" (30 minutes from now); "1d" (1 day from now). 1
You can use either one of two main approaches:
- Downscale the cluster to the minimal number of workers (2 workers) [1]
- Delete the cluster and recreate it later [2]
Both approaches work best when you use Google Cloud Storage Connector [3] instead of HDFS to store your data.
To downscale your cluster, you would run this command on off-peak hours:
gcloud dataproc clusters update <cluster-name> --num-workers <new-number-of-workers>
To delete the cluster for off-peak hours, use this:
gcloud dataproc clusters delete my-dataproc-cluster-name
Potentially, you can lower your Dataproc ongoing costs by up to 70% with Preemptible VMs [4] which are fully supported with Dataproc.
[2] Managing Dataproc Clusters
[3] Google Cloud Storage Connector for Spark/Hadoop
[4] Preemptible VMs