Sequential execution of multiple spark jobs in dataproc / gcp

Question

I would like to launch sequentially multiple spark jobs in gcp, like

gcloud dataproc jobs submit spark file1.py
gcloud dataproc jobs submit spark file2.py
...

so that the execution of one of those starts just when the execution of the previous job is completed.

Is there any way to do it?

tfayyaz tfayyaz · Accepted Answer · 2020-06-18T16:05:00

This can be done using Dataproc Workflows templates

This workflow will create and delete the cluster as part of the workflow.

These are the steps you can follow to create the workflow:

Create your workflow template

export REGION=us-central1

gcloud dataproc workflow-templates create workflow-id \
  --region $REGION

Set a Dataproc cluster type that will be used for the jobs

gcloud dataproc workflow-templates set-managed-cluster workflow-id \
    --region $REGION \ 
    --master-machine-type machine-type \ 
    --worker-machine-type machine-type \ 
    --num-workers number \ 
    --cluster-name cluster-name

Add the jobs as steps to your workflow

gcloud dataproc workflow-templates add-job pyspark gs://bucket-name/file1.py \
    --region $REGION \ 
    --step-id job1 \ 
    --workflow-template workflow-id

The second job needs the parameter --start-after to make sure it runs after the first job.

gcloud dataproc workflow-templates add-job pyspark gs://bucket-name/file2.py \
    --region $REGION \ 
    --step-id job2 \ 
    --start-after job1 \ 
    --workflow-template workflow-id

Run the workflow

gcloud dataproc workflow-templates instantiate template-id \
    --region $REGION \

Sequential execution of multiple spark jobs in dataproc / gcp

1 Answers