Google Cloud Dataflow Dependencies

Question

I want to use dataflow to process in parallel a bunch of video clips I have stored in google storage. My processing algorithm has non-python dependencies and is expected to change over development iterations.

My preference would be to use a dockerized container with the logic to process the clips, but it appears that custom containers are not supported (in 2017):

use docker for google cloud data flow dependencies

Although they may be supported now - since it was being worked on:

Posthoc connect FFMPEG to opencv-python binary for Google Cloud Dataflow job

According to this issue a custom docker image may be pulled, but I couldn't find any documentation on how to do it with dataflow.

https://issues.apache.org/jira/browse/BEAM-6706?focusedCommentId=16773376&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16773376

Another option might be to use setup.py to install any dependencies as described in this dated example:

https://cloud.google.com/blog/products/gcp/how-to-do-distributed-processing-of-landsat-data-in-python

However, when running the example I get an error that there is no module named osgeo.gdal.

For pure python dependencies I have also tried to pass the --requirements_file argument, however I still get an error: Pip install failed for package: -r

I could find documentation for adding dependencies to apache_beam, but not to dataflow, and it appears the apache_beam instructions do not work, based on my tests of --requirements_file and --setup_file

You say you've tried using --setup_file and it hasn't worked. Is that right? We have an example in Beam that relies on it. Can you check it, and see if that helps? github.com/apache/beam/tree/master/sdks/python/apache_beam/… — Pablo
The landsat example using setup_file didn't work for me. I tried the juliaset example and it does work! Thanks! Problem solved. Are there any plans to support docker images in the future? — shortcipher3
Docker images will be supported for pipelines using the portability framework. I can't give a great forecast for the timeline, but the plan is to support custom containers, yes. — Pablo

Cubez Cubez · Accepted Answer · 2019-07-15T21:44:11

This was answered in the comments, rewriting here for clarity:

In Apache Beam you can modify the setup.py file while will be run once per container on start-up. This file allows you to perform arbitrary commands before the the SDK Harness start to receive commands from the Runner Harness.

A complete example can be found in the Apache Beam repo.

Google Cloud Dataflow Dependencies

2 Answers