Dataflow jobs fail after a few 410 errors (while writing to GCS)

Question

I found a similar SO question from August which is more or less what my team is experiencing with our dataflow pipelines recently. How to recover from Cloud Dataflow job failed on com.google.api.client.googleapis.json.GoogleJsonResponseException: 410 Gone

Here is the exception (a few 410 exceptions were thrown over a span of ~1 hour, but I am pasting only the last one)

(9f012f4bc185d790): java.io.IOException: com.google.api.client.googleapis.json.GoogleJsonResponseException: 410 Gone
{
  "code" : 500,
  "errors" : [ {
    "domain" : "global",
    "message" : "Backend Error",
    "reason" : "backendError"
  } ],
  "message" : "Backend Error"
}
    at com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel.waitForCompletionAndThrowIfUploadFailed(AbstractGoogleAsyncWriteChannel.java:431)
    at com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel.close(AbstractGoogleAsyncWriteChannel.java:289)
    at com.google.cloud.dataflow.sdk.runners.worker.TextSink$TextFileWriter.close(TextSink.java:243)
    at com.google.cloud.dataflow.sdk.util.common.worker.WriteOperation.finish(WriteOperation.java:97)
    at com.google.cloud.dataflow.sdk.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:80)
    at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorker.executeWork(DataflowWorker.java:287)
    at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorker.doWork(DataflowWorker.java:223)
    at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorker.getAndPerformWork(DataflowWorker.java:173)
    at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorkerHarness$WorkerThread.doWork(DataflowWorkerHarness.java:193)
    at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorkerHarness$WorkerThread.call(DataflowWorkerHarness.java:173)
    at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorkerHarness$WorkerThread.call(DataflowWorkerHarness.java:160)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
    Suppressed: java.nio.channels.ClosedChannelException
        at com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel.throwIfNotOpen(AbstractGoogleAsyncWriteChannel.java:408)
        at com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel.close(AbstractGoogleAsyncWriteChannel.java:286)
        at com.google.cloud.dataflow.sdk.runners.worker.TextSink$TextFileWriter.close(TextSink.java:243)
        at com.google.cloud.dataflow.sdk.util.common.worker.WriteOperation.abort(WriteOperation.java:112)
        at com.google.cloud.dataflow.sdk.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:86)
        ... 10 more
Caused by: com.google.api.client.googleapis.json.GoogleJsonResponseException: 410 Gone
{
  "code" : 500,
  "errors" : [ {
    "domain" : "global",
    "message" : "Backend Error",
    "reason" : "backendError"
  } ],
  "message" : "Backend Error"
}
    at com.google.api.client.googleapis.json.GoogleJsonResponseException.from(GoogleJsonResponseException.java:146)
    at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:113)
    at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:40)
    at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:432)
    at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:352)
    at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:469)
    at com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel$UploadOperation.call(AbstractGoogleAsyncWriteChannel.java:357)
    ... 4 more
 2016-12-18 (18:58:58) Workflow failed. Causes: (8e88b50b0d86156a): S26:up:2016-12-18:userprofile-lifetime-20161218/Write/D...
(d3b59c20088d726e): Workflow failed. Causes: (8e88b50b0d86156a): S26:up:2016-12-18:userprofile-lifetime-20161218/Write/DataflowPipelineRunner.BatchWrite/Finalize failed., (2f2e396b6ba3aaa2): A work item was attempted 4 times without success. Each time the worker eventually lost contact with the service. The work item was attempted on: userprofile-lifetime-diff-12181745-c04d-harness-0xq5, userprofile-lifetime-diff-12181745-c04d-harness-adga, userprofile-lifetime-diff-12181745-c04d-harness-cz30, userprofile-lifetime-diff-12181745-c04d-harness-lt9m

Here is the job id: 2016-12-18_17_45_23-2873252511714037422

I am re-running the same job with the number of shards specified (to be 4000 as this job runs daily and normally outputs ~4k files), based on the answer to the other SO question I mentioned earlier. Is there a reason why limiting the number of shards to a number below 10k helps? Knowing this can be useful for us to re-design our pipeline if needed.

Also, when the number of shards is specified, the job is taking much longer than it would without it specified (mainly because of the step that writes to GCS) -- in terms of $'s, this job usually costs $75-80 (and we run this job daily), whereas it cost $130-$140 (that is a 74% increase) when I specified the number of shards (other steps seem to have run for the same duration, more or less -- the job id is 2016-12-18_19_30_32-7274262445792076535). So we really want to avoid having to specify the number of shards, if possible.

Any help and suggestions will be really appreciated!

-- Follow-up The output of this job seems to be disappearing and then appearing in GCS when I try 'gsutil ls' at the output directory, even 10+ hours after the job is completed. This may be a related issue, but I created a separate question here ("gsutil ls" shows a different list every time).

Frances Frances · Accepted Answer · 2016-12-19T18:30:54

Yes -- specifying the number of shards enforces additional constraints on how Dataflow executes the job and may impact performance. For example,dynamic work rebalancing is incompatible with a fixed number of shards.

As far as I understand, the 410 gone is a temporary GCS issue, and generally Dataflow's retry mechanism can work around it. However, if it occurs at too high a rate, it has the potential to fail the job. In batch mode, Dataflow will fail the job if a single bundle fails 4 times.

Dataflow jobs fail after a few 410 errors (while writing to GCS)

2 Answers