2
votes

I am having serious issues running a python Apache Beam pipeline using a GCP Dataflow runner, launched from CircleCI. I would really appreciate if someone could give any hint on how to tackle this, I've tried it all but nothing seems to work.

Basically, I'm running this python Apache Beam pipeline which runs in Dataflow and uses google-api-python-client-1.12.3. If I run the job in my machine (python3 main.py --runner dataflow --setup_file /path/to/my/file/setup.py), it works fine. If I run this same job from within CircleCI, the Dataflow job is created, but it fails with a message ImportError: No module named 'apiclient'.

By looking at this documentation, I think I should probably use explicitely a requirements.txt file. If I run that same pipeline from CircleCI, but adding the --requirements_file argument to a requirements file containing a single line (google-api-python-client==1.12.3), the dataflow job fails because the workers fail too. In the logs, there's a info message first ERROR: Could not find a version that satisfies the requirement wheel (from versions: none)" which results in a later error message "Error syncing pod somePodIdHere (\"dataflow-myjob-harness-rl84_default(somePodIdHere)\"), skipping: failed to \"StartContainer\" for \"python\" with CrashLoopBackOff: \"back-off 40s restarting failed container=python pod=dataflow-myjob-harness-rl84_default(somePodIdHere)\". I found this thread but the solution didn't seem to work in my case.

Any help would be really, really appreciated. Thanks a lot in advance!

1

1 Answers

2
votes

This question looks very similar to yours. The solution seemed to be to explicitly include the dependencies of your requirements in your requirements.txt

apache beam 2.19.0 not running on cloud dataflow anymore due to Could not find a version that satisfies the requirement setuptools>=40.8