I found a similar SO question from August which is more or less what my team is experiencing with our dataflow pipelines recently. How to recover from Cloud Dataflow job failed on com.google.api.client.googleapis.json.GoogleJsonResponseException: 410 Gone
Here is the exception (a few 410 exceptions were thrown over a span of ~1 hour, but I am pasting only the last one)
(9f012f4bc185d790): java.io.IOException: com.google.api.client.googleapis.json.GoogleJsonResponseException: 410 Gone
{
"code" : 500,
"errors" : [ {
"domain" : "global",
"message" : "Backend Error",
"reason" : "backendError"
} ],
"message" : "Backend Error"
}
at com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel.waitForCompletionAndThrowIfUploadFailed(AbstractGoogleAsyncWriteChannel.java:431)
at com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel.close(AbstractGoogleAsyncWriteChannel.java:289)
at com.google.cloud.dataflow.sdk.runners.worker.TextSink$TextFileWriter.close(TextSink.java:243)
at com.google.cloud.dataflow.sdk.util.common.worker.WriteOperation.finish(WriteOperation.java:97)
at com.google.cloud.dataflow.sdk.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:80)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorker.executeWork(DataflowWorker.java:287)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorker.doWork(DataflowWorker.java:223)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorker.getAndPerformWork(DataflowWorker.java:173)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorkerHarness$WorkerThread.doWork(DataflowWorkerHarness.java:193)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorkerHarness$WorkerThread.call(DataflowWorkerHarness.java:173)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorkerHarness$WorkerThread.call(DataflowWorkerHarness.java:160)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Suppressed: java.nio.channels.ClosedChannelException
at com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel.throwIfNotOpen(AbstractGoogleAsyncWriteChannel.java:408)
at com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel.close(AbstractGoogleAsyncWriteChannel.java:286)
at com.google.cloud.dataflow.sdk.runners.worker.TextSink$TextFileWriter.close(TextSink.java:243)
at com.google.cloud.dataflow.sdk.util.common.worker.WriteOperation.abort(WriteOperation.java:112)
at com.google.cloud.dataflow.sdk.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:86)
... 10 more
Caused by: com.google.api.client.googleapis.json.GoogleJsonResponseException: 410 Gone
{
"code" : 500,
"errors" : [ {
"domain" : "global",
"message" : "Backend Error",
"reason" : "backendError"
} ],
"message" : "Backend Error"
}
at com.google.api.client.googleapis.json.GoogleJsonResponseException.from(GoogleJsonResponseException.java:146)
at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:113)
at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:40)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:432)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:352)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:469)
at com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel$UploadOperation.call(AbstractGoogleAsyncWriteChannel.java:357)
... 4 more
2016-12-18 (18:58:58) Workflow failed. Causes: (8e88b50b0d86156a): S26:up:2016-12-18:userprofile-lifetime-20161218/Write/D...
(d3b59c20088d726e): Workflow failed. Causes: (8e88b50b0d86156a): S26:up:2016-12-18:userprofile-lifetime-20161218/Write/DataflowPipelineRunner.BatchWrite/Finalize failed., (2f2e396b6ba3aaa2): A work item was attempted 4 times without success. Each time the worker eventually lost contact with the service. The work item was attempted on: userprofile-lifetime-diff-12181745-c04d-harness-0xq5, userprofile-lifetime-diff-12181745-c04d-harness-adga, userprofile-lifetime-diff-12181745-c04d-harness-cz30, userprofile-lifetime-diff-12181745-c04d-harness-lt9m
Here is the job id: 2016-12-18_17_45_23-2873252511714037422
I am re-running the same job with the number of shards specified (to be 4000 as this job runs daily and normally outputs ~4k files), based on the answer to the other SO question I mentioned earlier. Is there a reason why limiting the number of shards to a number below 10k helps? Knowing this can be useful for us to re-design our pipeline if needed.
Also, when the number of shards is specified, the job is taking much longer than it would without it specified (mainly because of the step that writes to GCS) -- in terms of $'s, this job usually costs $75-80 (and we run this job daily), whereas it cost $130-$140 (that is a 74% increase) when I specified the number of shards (other steps seem to have run for the same duration, more or less -- the job id is 2016-12-18_19_30_32-7274262445792076535). So we really want to avoid having to specify the number of shards, if possible.
Any help and suggestions will be really appreciated!
-- Follow-up The output of this job seems to be disappearing and then appearing in GCS when I try 'gsutil ls' at the output directory, even 10+ hours after the job is completed. This may be a related issue, but I created a separate question here ("gsutil ls" shows a different list every time).