2
votes

I have a Python Cloud Dataflow job that works fine on smaller subsets, but seems to be failing for no obvious reasons on the complete dataset.

The only error I get in the Dataflow interface is the standard error message:

A work item was attempted 4 times without success. Each time the worker eventually lost contact with the service.

Analysing the Stackdriver logs only shows this error:

Exception in worker loop: Traceback (most recent call last): File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 736, in run deferred_exception_details=deferred_exception_details) File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 590, in do_work exception_details=exception_details) File "/usr/local/lib/python2.7/dist-packages/apache_beam/utils/retry.py", line 167, in wrapper return fun(*args, **kwargs) File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 454, in report_completion_status exception_details=exception_details) File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 266, in report_status work_executor=self._work_executor) File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/workerapiclient.py", line 364, in report_status response = self._client.projects_jobs_workItems.ReportStatus(request) File "/usr/local/lib/python2.7/dist-packages/apache_beam/internal/clients/dataflow/dataflow_v1b3_client.py", line 210, in ReportStatus config, request, global_params=global_params) File "/usr/local/lib/python2.7/dist-packages/apitools/base/py/base_api.py", line 723, in _RunMethod return self.ProcessHttpResponse(method_config, http_response, request) File "/usr/local/lib/python2.7/dist-packages/apitools/base/py/base_api.py", line 729, in ProcessHttpResponse self.__ProcessHttpResponse(method_config, http_response, request)) File "/usr/local/lib/python2.7/dist-packages/apitools/base/py/base_api.py", line 599, in __ProcessHttpResponse http_response.request_url, method_config, request) HttpError: HttpError accessing https://dataflow.googleapis.com/v1b3/projects//jobs/2017-05-03_03_33_40-3860129055041750274/workItems:reportStatus?alt=json>: response: <{'status': '400', 'content-length': '360', 'x-xss-protection': '1; mode=block', 'x-content-type-options': 'nosniff', 'transfer-encoding': 'chunked', 'vary': 'Origin, X-Origin, Referer', 'server': 'ESF', '-content-encoding': 'gzip', 'cache-control': 'private', 'date': 'Wed, 03 May 2017 16:46:11 GMT', 'x-frame-options': 'SAMEORIGIN', 'content-type': 'application/json; charset=UTF-8'}>, content <{ "error": { "code": 400, "message": "(2a7b20b33659c46e): Failed to publish the result of the work update. Causes: (2a7b20b33659c523): Failed to update work status. Causes: (8a8b13f5c3a944ba): Failed to update work status., (8a8b13f5c3a945d9): Work \"4047499437681669251\" not leased (or the lease was lost).", "status": "INVALID_ARGUMENT" } } >

I assume this Failed to update work status error is related to the Cloud Runner? But since I didn't find any information on this error online, I was wondering if somebody else encountered it and does have a better explanation?

I am using Google Cloud Dataflow SDK for Python 0.5.5.

1
What's are your pipeline source(s) and sink(s)?Graham Polley
The 2 sources are avro-files on GCS and the sinks are TFRecord files on GCS.Fematich
Do you have a job ID available to share? Any details about what your pipeline is doing that you could describe?Ben Chambers
The Job ID: 2017-05-07_13_10_15-6017060458892203962, the pipeline is a preprocessing job for ML Engine. It starts from 2 sets of AVRO files, combines these and then generates TFRecords from the combined data.Fematich

1 Answers

2
votes

One major cause of lease expirations is related to memory pressure on the VM. You may try running your job on machines with higher memory. Particularly, a highmem machine type should do the trick.

For more info on machine types, please check out the GCE Documentation

The next Dataflow release (2.0.0) should be able to handle these cases better.