2
votes

I am running a dataflow that I last ran a few months ago. From the same client, with same dataflow version (0.7.0dev0). Unfortunately it fails in mysterious ways that it did not do before.

I am starting the job, and the first stage is:

(8733429d016bc2fb): Executing operation read from datastore/Split Query+read from datastore/GroupByKey/Reify+read from datastore/GroupByKey/Write

But it gives the following error after 1 hour:

(e88cb3c076926976): Workflow failed. Causes: (e88cb3c07692626f): The Dataflow appears to be stuck. Please reach out to the Dataflow team at http://stackoverflow.com/questions/tagged/google-cloud-dataflow.

if would help, JobID is 2017-08-21_00_30_03-3588685705436948852. I would upgrade to a newer version of the library, but that involves a bunch more API changes and figuring out how to get all the pieces working again. So I'm working at it now. I was hoping that "a simple use case that previously worked and currently fails" might be easier to debug than changing even-more-things.

I'm not sure how to debug or investigate further. It worked a few months ago with the same code, but doesn't work now (with a 4-5x larger dataset, of 200-300K records, nothing crazy...)

1
Could you share a job ID or any more details of your pipeline? Would it be possible to upgrade to a newer version? - Ben Chambers
Okay, things seem to work after upgrading to 2.0.0! (Required some import fixups, reworking how I download/import apache-beam, etc.) I assume there's just some bitrot on the gcloud servers not supporting the 0.7.0-dev version... - Mike Lambert
I am experiencing this exact issue, the job used to take 4-6 minutes, but the job doesnt end now, rather it just doest start, it shows partially running state on GroupByKey and running on UserQuery and SplitQuery. I was using 2.1.0 python SDK, tried using 2.0.0 SDK but the error still persists. how do i go about it? @BenChambers - Anuj
@BenChambers also the data i am working on hasnt changed in size, Since the job used to take 4-5 minutes i stopped all the jobs which ran more than 10 minutes, ill try to check if it shows the workflow-failed error - Anuj
Please start a new question -- since you were already on a newer SDK and likely have a different pipeline it is likely a different issue. Job IDs will also be necessary to dig in much further. - Ben Chambers

1 Answers

2
votes

This was fixed by upgrading to 2.0.0 (thanks Ben Chambers!) It seems that 0.7.0 no longer worked well with cloud dataflow.