4
votes

Due to our business requirements we are bound to use static long running persistent Dataproc clusters. Is there any way to upgrade the Dataproc image to leverage latest OS/OSS updates?

Please help me with some reference documentation to carry out this operation (preferably automation).

3
What drives the requirement for persistent clusters? Is it the volume of jobs submitted? Do you store data on the cluster in HDFS? - tix
In your other question you reference "1-namenode and n-number of datanodes" which suggests it is indeed data in HDFS. Any reason this is required, as opposed to storing final job output in GCS or BigQuery? - tix
Actually the huge volume of jobs submitted throughout the day is the reason driving the requirement of persistent cluster. We keep receiving files throughout the day and based on the business logic jobs are kicked-off on regular intervals. - Balajee Venkatesh
"1-namenode & n-datanode" highlights our on-prem cluster config. We haven't migrated to Dataproc yet. I just wanted to check for some references which would help us plan the equivalent Dataproc configuration. - Balajee Venkatesh
I added a few links to my answer that I hope will help. - tix

3 Answers

3
votes

In-place cluster upgrade is not something supported by Dataproc today, and is the reason we advise customers to instead use ephemeral (per job/workflow) or short lived clusters (on the order of weeks, not years).

Unfortunately, Oozie does not play well with cloud-native or -hybrid architectures. I would suggest building cluster-failover capabilities into your automation so you can delete/recreate every so often. Perhaps as part of cluster startup, it can emit a lock file that will prevent old cluster from spawning new jobs?

Here's additional references that may help.

On decoupling compute and storage:

https://www.qubole.com/blog/advantage-decoupling/

https://cloud.google.com/blog/products/storage-data-transfer/hdfs-vs-cloud-storage-pros-cons-and-migration-tips

Options for long-lived clusters:

https://cloud.google.com/blog/products/data-analytics/10-tips-for-building-long-running-clusters-using-cloud-dataproc

See my second answer below for one way to deal with Oozie specifically.

0
votes

The following is a suggestion for what a cloud-hybrid migration of Oozie could look like.

For first step, I would focus on lift-and-shift and focus on separation of Compute and Storage (e.g., replace HDFS with GCS). The below sketch would be a step 2 of a migration.

The main blocker with Oozie is that it has triggering on events and scheduling bundled. I would move triggering and scheduling out into external Airflow such as Cloud Composer. This will allow you to parameterize Oozie workflows with file names, but otherwise they become "run this workflow on this file".

In response to a new file, Airflow will run DataprocWorkflowTemplateOperator (there's also an inline workflow operator if you'd rather inline the workflow definition in Airflow). This workflow template will contain a single job which triggers the Oozie workflow via pig sh oozie job ....

The part that's relevant to your question, and gives you advantages of a Cloud migration: the workflow template, will use Cluster Selector which will choose among one or more clusters based on cluster labels. This means you can use cluster labels to add and remove clusters to a pool. Once labels are removed, new Workflow Templates will not be submitted to the old cluster; and once all jobs finish you can delete it (thus upgrading the image). Another advantage, is you can maintain 2+ clusters in different GCP zones and have failover in case of service outages. Also by decoupling scheduling from execution, you're no longer tied to a single long lived cluster.

To summarize, by decoupling things into Airflow + Workflow Templates + Oozie you get:

  • Cluster OS and OSS upgrades are possible
  • You can opt to run 2+ clusters in different GCP zones and have failover in case of service outages
  • By decoupling scheduling from execution and storage from compute you're no longer tied to a single long lived cluster
0
votes

To upgrade cluster that continually executes new jobs you can leverage user-specified label load balancing feature.

It will allow to route jobs between the clusters (old and new) based on user labels that can be applied dynamically - this will allow to perform cluster upgrade without downtime.