3
votes

I would like to launch sequentially multiple spark jobs in gcp, like

gcloud dataproc jobs submit spark file1.py
gcloud dataproc jobs submit spark file2.py
...

so that the execution of one of those starts just when the execution of the previous job is completed.

Is there any way to do it?

1

1 Answers

4
votes

This can be done using Dataproc Workflows templates

This workflow will create and delete the cluster as part of the workflow.

These are the steps you can follow to create the workflow:

  1. Create your workflow template
export REGION=us-central1

gcloud dataproc workflow-templates create workflow-id \
  --region $REGION
  1. Set a Dataproc cluster type that will be used for the jobs
gcloud dataproc workflow-templates set-managed-cluster workflow-id \
    --region $REGION \ 
    --master-machine-type machine-type \ 
    --worker-machine-type machine-type \ 
    --num-workers number \ 
    --cluster-name cluster-name
  1. Add the jobs as steps to your workflow
gcloud dataproc workflow-templates add-job pyspark gs://bucket-name/file1.py \
    --region $REGION \ 
    --step-id job1 \ 
    --workflow-template workflow-id

The second job needs the parameter --start-after to make sure it runs after the first job.

gcloud dataproc workflow-templates add-job pyspark gs://bucket-name/file2.py \
    --region $REGION \ 
    --step-id job2 \ 
    --start-after job1 \ 
    --workflow-template workflow-id
  1. Run the workflow
gcloud dataproc workflow-templates instantiate template-id \
    --region $REGION \