Long-running Dataflow job fails with no errors in user code

Question

After running for 17 hours, my Dataflow job failed with the following message:

The job failed because a work item has failed 4 times. Look in previous log entries for the cause of each one of the 4 failures.

The 4 failures consist of 3 workers losing contact with the service, and one worker reported dead:

****-q15f Root cause: The worker lost contact with the service.
****-pq33 Root cause: The worker lost contact with the service.
****-fzdp Root cause: The worker ****-fzdp has been reported dead. Aborting lease 4624388267005979538.
****-nd4r Root cause: The worker lost contact with the service.

I don't see any errors in the worker logs for the job in Stackdriver. Is this just bad luck? I don't know how frequently work items need to be retried, so I don't know what the probability is that a single work item will fail 4 times over the course of a 24 hour job. But this same type of job failure happens frequently for this long-running job, so it seems like we need some way to either decrease the failure rate of work items, or increase the allowed number of retries. Is either possible? This doesn't seem related to my pipeline code, but in case it's relevant, I'm using the Python SDK with apache-beam==2.15.0. I'd appreciate any advice on how to debug this.

Update: The "STACK TRACES" section in the console is totally empty.

I'm having the same problem. There doesn't appear to be any errors on the worker's stackdriver either. — manesioz
I'm sorry you're encountering this. You should just file a support ticket for this sort of issue, as there's not enough info to figure out what's going on. Something that can happen is that the workers are running out of memory (ooming), and not sending updates to the service properly - is your operation memory-intensive? — Pablo

manesioz manesioz · Accepted Answer · 2019-10-08T13:44:15

I was having the same problem and it was solved by scaling up my workers resources. Specifically, I set --machine_type=n1-highcpu-96 in my pipeline configs. See this for a more extensive list on machine type options.

Edit: Set it to highcpu or highmem depending on the requirements of your pipeline process

Long-running Dataflow job fails with no errors in user code

1 Answers