0
votes

Our cloud DataFlow job reads from BigQuery, does some preprocessing and then writes back to BigQuery. Unfortunately, it has failed after some hours when reading from BigQuery with the following error message:

raise exceptions.HttpError.FromResponse(response) apitools.base.py.exceptions.HttpNotFoundError: HttpError accessing : response: <{'x-guploader-uploadid': 'AEnB2UpgIuanY0AawrT7fRC_VW3aRfWSdrrTwT_TqQx1fPAAAUohVoL-8Z8Zw_aYUQcSMNqKIh5R2TulvgHHsoxLWo2gl6wUEA', 'content-type': 'text/html; charset=UTF-8', 'date': 'Tue, 19 Nov 2019 15:28:07 GMT', 'vary': 'Origin, X-Origin', 'expires': 'Tue, 19 Nov 2019 15:28:07 GMT', 'cache-control': 'private, max-age=0', 'content-length': '142', 'server': 'UploadServer', 'status': '404'}>, content No such object: --project--/beam/temp--job-name---191119-084402.1574153042.687677/11710707918635668555/000000000009.avro>

Before this error, the logs show a lot of entries similar to these ones:

enter image description here

Does someone have an idea what might cause the DataFlow job to fail? When running this job on a small subset of the data, there is no problem at all.

1

1 Answers

1
votes

We have taken a closer look at the logs and have found a lot of records looking as follows:

Processing lull for over 350.68 seconds in state process-msecs in step s2. Traceback [...] doc = spacy(input_str)

We have investigated more on this error message and have found out that version 1.1.8 of spaCy (used in our pipeline for lemmatization) is subject to a memory leakage as described here: GitHub Accordingly, we have upgraded spaCy to the most recent version and the problem disappeared.