External Python Dependencies in Dataflow Pipeline

Question

Can python dependencies be loaded into a Google Cloud Dataflow pipeline?

I would like to use gensim's phrase modeler which reads data line by line to automatically detect common phrases/bigrams (two words that frequently appear next to each other).

So the first run through of the pipeline would be passing each sentence to this phrase modeler.

The second pass through the pipeline would then take the same phrase modeler and apply this phrase modeler to each sentence to identify the phrases that should be modeled together. Example:

If machine and learning frequently appear next to each other in the corpus, they would be transformed to a single word machine_learning instead.

Would this be possible to accomplish within Dataflow?

Can a build/requirements file be passed forcing pip install gensim on the worker machines?

Youxun Shen Youxun Shen · Accepted Answer · 2017-10-12T03:09:20

You can check out this page for managing dependencies in your pipeline:

https://beam.apache.org/documentation/sdks/python-pipeline-dependencies

example: For packages on PyPI, you can use requirement file by adding the following command line option:

--requirements_file requirements.txt

External Python Dependencies in Dataflow Pipeline

1 Answers