2
votes

Some of our Dataflow jobs randomly crash while reading source data files.

The following error is written in the job logs (there is nothing in the workers logs) :

11 févr. 2016 à 08:30:54
(33b59f945cff28ab): Workflow failed. 
Causes: (fecf7537c059fece): S02:read-edn-file2/TextIO.Read+read-edn-file2    
/ParDo(ff19274a)+ParDo(ff19274a)5+ParDo(ff19274a)6+RemoveDuplicates
/CreateIndex+RemoveDuplicates/Combine.PerKey
/GroupByKey+RemoveDuplicates/Combine.PerKey/Combine.GroupedValues
/Partial+RemoveDuplicates/Combine.PerKey/GroupByKey/Reify+RemoveDuplicates
/Combine.PerKey/GroupByKey/Write faile

We also get that kind of error sometimes (logged in the workers logs) :

2016-02-15T10:27:41.024Z: Basic:  S18: (43c8777b75bc373e): Executing operation group-by2/GroupByKey/Read+group-by2/GroupByKey/GroupByWindow+ParDo(ff19274a)19+ParDo(ff19274a)20+ParDo(ff19274a)21+write-edn-file3/ParDo(ff19274a)+write-bq-table-from-clj3/ParDo(ff19274a)+write-bq-table-from-clj3/BigQueryIO.Write+write-edn-file3/TextIO.Write
2016-02-15T10:28:03.994Z: Error:   (af73c53187b7243a): java.io.IOException: com.google.api.client.googleapis.json.GoogleJsonResponseException: 410 Gone
{
 "code" : 503,
 "errors" : [ {
   "domain" : "global",
   "message" : "Backend Error",
   "reason" : "backendError"
 } ],
 "message" : "Backend Error"
}
    at com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel.waitForCompletionAndThrowIfUploadFailed(AbstractGoogleAsyncWriteChannel.java:431)
    at com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel.close(AbstractGoogleAsyncWriteChannel.java:289)
    at com.google.cloud.dataflow.sdk.runners.worker.TextSink$TextFileWriter.close(TextSink.java:243)
    at com.google.cloud.dataflow.sdk.util.common.worker.WriteOperation.finish(WriteOperation.java:100)
    at com.google.cloud.dataflow.sdk.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:77)
    at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorker.executeWork(DataflowWorker.java:254)
    at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorker.doWork(DataflowWorker.java:191)
    at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorker.getAndPerformWork(DataflowWorker.java:144)
    at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorkerHarness$WorkerThread.doWork(DataflowWorkerHarness.java:180)
    at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorkerHarness$WorkerThread.call(DataflowWorkerHarness.java:161)
    at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorkerHarness$WorkerThread.call(DataflowWorkerHarness.java:148)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

The source data files are stored in google cloud storage.

Data paths are correct and the job generally works after relaunching it. We didn't experience this issue until the end of january.

Jobs are launched with these parameters : --tempLocation='gstoragelocation' --stagingLocation='another gstorage location' --runner=BlockingDataflowPipelineRunner --numWorkers='a few dozen' --zone=europe-west1-d

SDK version : 1.3.0

Thanks

1
Sorry for the trouble. We're currently investigating this and similar issues with the Google Cloud Storage team. Could you provide an example failed job ID? - jkff
The issue hit by the first job on 2/10 should have been fixed this week. Please let us know if you see it again. Does the second type of error cause the job to fail or is it transient enough that bundles succeed on retry? - Frances
Thanks for your answer. One of our jobs failed again this morning : 2016-02-21_23_00_17-5627071082821060268. The error causes the job to fail even if there are retries, but it generally succeed if the job is relaunched manually (for the first and second type of error) - Pierre
Another example of job which just failed : 2016-02-22_02_13_22-5788209240587963563. Workers logs are almost empty for this one. - Pierre
Thanks, Pierre. We're continuing to investigate. - Frances

1 Answers

0
votes

As a clearly-marked "backend error", this should be reported on either the Cloud Dataflow Public Issue Tracker](google-cloud-dataflow) or the more general Cloud Platform Public Issue Tracker and there's relatively little anybody on Stack Overflow could do to help you debug this.