I want to use dataflow to process in parallel a bunch of video clips I have stored in google storage. My processing algorithm has non-python dependencies and is expected to change over development iterations.
My preference would be to use a dockerized container with the logic to process the clips, but it appears that custom containers are not supported (in 2017):
use docker for google cloud data flow dependencies
Although they may be supported now - since it was being worked on:
Posthoc connect FFMPEG to opencv-python binary for Google Cloud Dataflow job
According to this issue a custom docker image may be pulled, but I couldn't find any documentation on how to do it with dataflow.
Another option might be to use setup.py to install any dependencies as described in this dated example:
However, when running the example I get an error that there is no module named osgeo.gdal.
For pure python dependencies I have also tried to pass the --requirements_file
argument, however I still get an error: Pip install failed for package: -r
I could find documentation for adding dependencies to apache_beam, but not to dataflow, and it appears the apache_beam instructions do not work, based on my tests of --requirements_file
and --setup_file
--setup_file
and it hasn't worked. Is that right? We have an example in Beam that relies on it. Can you check it, and see if that helps? github.com/apache/beam/tree/master/sdks/python/apache_beam/… – Pablo