1
votes

So I had a job running for downloading some files and it usually takes about 10 minutes. this one ran for more than an hour before it finally failed with the following, only error message:

Workflow failed. Causes: (3f03d0279dd2eb98): The Dataflow appears to be stuck. Please reach out to the Dataflow team at http://stackoverflow.com/questions/tagged/google-cloud-dataflow.

So here I am :-) The jobId: 2017-08-29_13_30_03-3908175820634599728

Just out of curiosity, will we be billed for the hour of stuckness? And what was the problem?

I'm working with Dataflow-Version 1.9.0

Thanks Google Dataflow Team

1
Something is definitely very odd with that job. I've marked it for internal investigation and we'll get back to you when we figure out what went wrong. Sorry about that! Is this a consistent failure? Or is the job running fine now?Lara Schmidt
Due to the fact that the downloadpipeline is created inside a DoFn it was retried automatically and finished after ~6.5min. JobId: 2017-08-29_15_29_11-1856842692501462974user2122552

1 Answers

1
votes

It seems as though the job had all its workers spending all the time doing Java garbage collection (almost 100%, about 7 second Full GCs occurring every ~7 seconds).

Your next best steps are to get a heap dump of the job by logging into one of the machines and using jmap. Use a heap dump analysis tool to inspect where all the memory is allocated to. It is best to compare the heap dump of a properly functioning job against the heap dump of a broken job. If you would like further help from Google, feel free to contact Google Cloud Support and share this SO question and the heap dumps. This would be especially useful if you suspect the issue is somewhere within Google Cloud Dataflow.