0
votes

I am using Google Dataflow Service to run some apache-beam scripts for ETL.

The jobs used to take 4-5 minutes to complete initially, but now they fail after an hour with following error.

Workflow failed. Causes: (35af2d4d3e5569e4): The Dataflow appears to be stuck.

It appears that the the job didnt actually start.

I was executing it by using python SDK 2.1.0 As answer of this question mentioned to switch the SDK, i tried executing it using python SDK 2.0.0 but no luck.

Job Id is: 2017-09-28_04_28_31-11363700448712622518

Update:

After @BenChambers suggested to check up the logs, it appears that the jobs didn't startup because of the failure of the workers starting

The logs showed following logs 4 times(As mentioned in the dataflow docs, a bundle is tried 4 times before declaring it to be failed)

Running setup.py install for dataflow-worker: finished with status 'done' 
Successfully installed dataflow-worker-2.1.0 
Executing: /usr/local/bin/pip install /var/opt/google/dataflow/workflow.tar.gz 
Processing /var/opt/google/dataflow/workflow.tar.gz 
 Complete output from command python setup.py egg_info: 
 Traceback (most recent call last): 
   File "<string>", line 1, in <module>
 IOError: [Errno 2] No such file or directory: '/tmp/pip-YAAeGg-build/setup.py' 

 ---------------------------------------- 
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-YAAeGg-build/ 
/usr/local/bin/pip failed with exit status 1 
Dataflow base path override: https://dataflow.googleapis.com/ 
Failed to report setup error to service: could not lease work item to report failure (no work items returned)
2

2 Answers

1
votes

A common cause of stuck pipelines is the inability for workers to startup. From the UI, you should be able to click "Logs" near the top, then the link that says "Stackdriver". This should take you to the Stackdriver Logging page, configured to view the worker logs for the given job. If you change that from worker to worker-startup it should show you the logs from trying to start workers. If there are problems during startup they should show up here.

0
votes

When a job is submitted to dataflow service, it installs latest version of apache-beam to the workers. At present, the latest version of apache-beam is 2.1.0. Either apache beam or google cloud python packages must be using python package named six for its internal implementation.

As this answer suggests, the latest version of package six, i.e. 1.11.0 does not work with apache-beam 2.1.0.

I would suggest you to provide a setup file for dataflow service which will mention that version of six should be 1.10, not 1.11. you could do that by giving install-requires parameters to the setup file

install_requires=[
    'six==1.10.0',      
    ]

You could read about setuptools at this link

You could read about how to provide setup file to dataflow job at this link


Update

When you submit your job to dataflow, dataflow service spins up compute engines as its workers and installs all requirements needed for the dataflow to run. When so all the python packages it installs is in the hands of dataflow service and it installs whatever their default configuration is. This might lead to the issues such as your.

Solution to this is to provide a requirements file to the dataflow job by providing a requirements_file argument to the pipeline_options of the pipe. This will make the dataflow service to install the python packages that you have mentioned in the requirements file on the workers and the issues caused due to versioning of the packages can be avoided.

You could find about how to provide requirements file to dataflow pipeline on this link