3
votes

I am trying to test my dataflow pipeline on the DataflowRunner. My code always gets stuck at 1 hr 1min and says: The Dataflow appears to be stuck. When digging through the stack trace of the Dataflow stackdriver, I come across the error saying the Failed to install packages: failed to install workflow: exit status 1. I saw other stack overflow messages saying that this can be caused when pip packages are not compatible. This is causing my worker startup to always fail.

This is my current setup.py. Can someone please help me understand what I am missing. The job id is 2018-02-09_08_22_34-6196858167817670597.

setup.py

from setuptools import setup, find_packages


requires = [
            'numpy==1.14.0',
            'google-cloud-storage==1.7.0',
            'pandas==0.22.0',
            'sqlalchemy-vertica[pyodbc,turbodbc,vertica-python]==0.2.5',
            'sqlalchemy==1.2.2',
            'apache_beam[gcp]==2.2.0',
            'google-cloud-dataflow==2.2.0'
            ]

setup(
    name="dataflow_pipeline_dependencies",
    version="1.0.0",
    description="Beam pipeline for flattening ism data",
    packages=find_packages(),
    install_requires=requires
)
4

4 Answers

3
votes

Include 'workflow' package in the setup.py required packages. Error is solved after including it.

from setuptools import setup, find_packages


requires = [
            'numpy==1.14.0',
            'google-cloud-storage==1.7.0',
            'pandas==0.22.0',
            'sqlalchemy-vertica[pyodbc,turbodbc,vertica-python]==0.2.5',
            'sqlalchemy==1.2.2',
            'apache_beam[gcp]==2.2.0',
            'google-cloud-dataflow==2.2.0',
            'workflow' # Include this line
            ]

setup(
    name="dataflow_pipeline_dependencies",
    version="1.0.0",
    description="Beam pipeline for flattening ism data",
    packages=find_packages(),
    install_requires=requires
)
2
votes

So I have figured out that workflow is not a pypi package in this case, but actually the name of the .tar that is created by Dataflow which contains the source code. Dataflow will compress your source code and create a workflow.tar file in your staging environment, then it will try to run pip install workflow.tar. If any issues comes up from this install, it will fail to install the packages onto the workers.

My issue was resolved by a few things: 1) I added six==1.10.0 to my requires, as I found from : Workflow failed. Causes: (35af2d4d3e5569e4): The Dataflow appears to be stuck , that there is an issue with the latest version of six. 2) I realized that sqlalchemy-vertica and sqlalchemy are out of sync and have issues with dependency versions. I hence removed my need for both and found a different vertica client.

0
votes

I am no genius with dealing with a lot of Python packages and how to manage all the versions, incompatibilities and needs and wants of every one.

However, I can read error messages.

In your case the message says "failed to install workflow". After a quick Google search I found that "workflow" actually is a Python package.

So the error is simply complaining that you haven't installed workflow and that it's attempt to do so failed.

To fix this problem:

  • Install workflow from this PyPI link. This is the latest version that Google showed me.

Or

  • Do the regular pip install workflow.

Either method should install the required package. Once that is installed that particular error message should go away.

I hope this answer helped you!

0
votes

Your mileage may vary, but for me, none of the above worked (Python 3.7).

Instead, the solution seemed to be to have my dependencies in a requirements.txt file and then everything else in setup.py. It was important that I not load requirements.txt lines into the install_requires property. Any way I did it, including workflow or not, having install_requires seemed to lead me to this error.

Instead, my setup.py simply does not specify dependencies at all. I gave both the --requirements_file and --setup_file arguments when running the pipeline. That solved the issue for me, and there was a noticeable difference in how the pipeline built and launched, as the dependencies were stored in the staging location this way, whereas before they were not.

For example:

setup.py

import setuptools

setuptools.setup(
    name='my_pipeline',
    version='0.0.0',
    packages=setuptools.find_packages()
)

requirements.txt

google-cloud-bigquery==1.24.0
google-cloud-storage==1.25.0
jinja2==2.11.1
[...etc...]

run_pipeline.sh

#!/usr/bin/env bash

[...code to set vars...]

if [ "${1}" = "dataflow" ]; then
  RUNNER="--runner DataflowRunner"
fi

python "${PIPELINE_FILE}" \
    --output "${OUTPUT}" \
    --project myproject \
    --region us-west1 \
    --temp_location "${TEMP}" \
    --staging_location "${STAGING}" \
    --no_use_public_ips \
    --requirements_file requirements.txt \
    --setup_file "./setup.py" \
    ${RUNNER}