This can be done using Dataproc Workflows templates
This workflow will create and delete the cluster as part of the workflow.
These are the steps you can follow to create the workflow:
- Create your workflow template
export REGION=us-central1
gcloud dataproc workflow-templates create workflow-id \
--region $REGION
- Set a Dataproc cluster type that will be used for the jobs
gcloud dataproc workflow-templates set-managed-cluster workflow-id \
--region $REGION \
--master-machine-type machine-type \
--worker-machine-type machine-type \
--num-workers number \
--cluster-name cluster-name
- Add the jobs as steps to your workflow
gcloud dataproc workflow-templates add-job pyspark gs://bucket-name/file1.py \
--region $REGION \
--step-id job1 \
--workflow-template workflow-id
The second job needs the parameter --start-after to make sure it runs after the first job.
gcloud dataproc workflow-templates add-job pyspark gs://bucket-name/file2.py \
--region $REGION \
--step-id job2 \
--start-after job1 \
--workflow-template workflow-id
- Run the workflow
gcloud dataproc workflow-templates instantiate template-id \
--region $REGION \