1
votes

I am trying to run pig jobs on managed DataProc cluster. I have several independent pig jobs that run in parallel. I have set continueOnFailure for each jobs to be true. Now, if one of the job fails all the others are stopped and the cluster is terminated. I dont want that, I want the failing job to be terminated and the other jobs to run as expected.

The yaml file through which I am instantiating is as below:

jobs:
- pigJob:
    continueOnFailure: true
    queryList:
      queries:
      - sh pqr.sh
  stepId: run-pig-pqr
- pigJob:
    continueOnFailure: true
    queryList:
      queries:
      - sh abc.sh
  stepId: run-pig-abc

placement:
  managedCluster:
    clusterName: batch-job
    config:
      gceClusterConfig:
        zoneUri: asia-south1-a
      masterConfig:
        machineTypeUri: n1-standard-8
        diskConfig:
          bootDiskSizeGb: 50
      workerConfig:
        machineTypeUri: n2-highcpu-64
        numInstances: 2
        diskConfig:
          bootDiskSizeGb: 50
      softwareConfig:
        imageVersion: 1.4-ubuntu18

I am creating the cluster with command

gcloud dataproc workflow-templates instantiate-from-file --file $file-name.yaml

I am giving any wrong config in my yaml ?

2

2 Answers

1
votes

The point of clarification that may not be obvious in this situation is that the continueOnFailure parameter is specifically a PigJob parameter, not a Dataproc Workflow parameter; you'll see it as well on HiveJob, for example, but not other Dataproc job types in the workflow. Thus, in this case continueOnFailure1 only applies to the behavior of separate commands being run inside a single PigJob, rather than being a setting for how multiple PigJobs will behave when placed in a shared Dataproc Workflow.

At the moment, Dataproc Workflows unfortunately doesn't support a control for specifying whether to continue the rest of the workflow when a single job in the workflow fails; instead, the current behavior assumes all the jobs are expected to succeed or else the workflow is aborted.

As you point out, this clearly isn't a complete story to support all the use cases of Workflows. As mentioned in the comments, this would be a good feature request to file under https://cloud.google.com/support/docs/issue-trackers

In the meantime, using Dataproc Cluster Scheduled Deletion is probably the closest to what you want if you don't want to manually manage the multiple jobs to know when to tear down the cluster when the last one finishes. Though you'd have to wait synchronously for cluster creation to complete before submitting jobs, you would then use --async on the jobs to not have to poll on each job before submitting them all:

gcloud dataproc clusters create --max-idle=10m ${CLUSTER_NAME}
gcloud dataproc jobs submit pig --async --cluster ${CLUSTER_NAME} -e 'sh pqr.sh' 
gcloud dataproc jobs submit pig --async --cluster ${CLUSTER_NAME} -e 'sh abc.sh'

This would only really be efficient if your jobs run much longer than the minimum idle TTL of 10 minutes though.

0
votes

The continueOnFailure flag appears to work as expected in Pig: for some types of failures, the interpreter will ignore failures and keep going. However, the pig driver still exits with non-zero error code which causes Dataproc job to fail, and then Workflow will cancel all jobs and delete the cluster.

Since you're using shell commands, you could trap exit and replace it with code 0:

function finish {
    exit 0
}
trap finish ERR

I would also encourage you to file a feature request to add better toggles for handling errors as part of workflows here: https://issuetracker.google.com/issues/new?component=187133&template=0