2
votes

I want to use dataflow to process in parallel a bunch of video clips I have stored in google storage. My processing algorithm has non-python dependencies and is expected to change over development iterations.


My preference would be to use a dockerized container with the logic to process the clips, but it appears that custom containers are not supported (in 2017):

use docker for google cloud data flow dependencies

Although they may be supported now - since it was being worked on:

Posthoc connect FFMPEG to opencv-python binary for Google Cloud Dataflow job

According to this issue a custom docker image may be pulled, but I couldn't find any documentation on how to do it with dataflow.

https://issues.apache.org/jira/browse/BEAM-6706?focusedCommentId=16773376&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16773376

Another option might be to use setup.py to install any dependencies as described in this dated example:

https://cloud.google.com/blog/products/gcp/how-to-do-distributed-processing-of-landsat-data-in-python

However, when running the example I get an error that there is no module named osgeo.gdal.

For pure python dependencies I have also tried to pass the --requirements_file argument, however I still get an error: Pip install failed for package: -r

I could find documentation for adding dependencies to apache_beam, but not to dataflow, and it appears the apache_beam instructions do not work, based on my tests of --requirements_file and --setup_file

2
You say you've tried using --setup_file and it hasn't worked. Is that right? We have an example in Beam that relies on it. Can you check it, and see if that helps? github.com/apache/beam/tree/master/sdks/python/apache_beam/…Pablo
The landsat example using setup_file didn't work for me. I tried the juliaset example and it does work! Thanks! Problem solved. Are there any plans to support docker images in the future?shortcipher3
Docker images will be supported for pipelines using the portability framework. I can't give a great forecast for the timeline, but the plan is to support custom containers, yes.Pablo

2 Answers

3
votes

This was answered in the comments, rewriting here for clarity:

In Apache Beam you can modify the setup.py file while will be run once per container on start-up. This file allows you to perform arbitrary commands before the the SDK Harness start to receive commands from the Runner Harness.

A complete example can be found in the Apache Beam repo.

2
votes

As of 2020, you can use Dataflow Flex Templates, which allow you to specify a custom Docker container in which to execute your pipeline.