Can python dependencies be loaded into a Google Cloud Dataflow pipeline?
I would like to use gensim's phrase modeler which reads data line by line to automatically detect common phrases/bigrams (two words that frequently appear next to each other).
So the first run through of the pipeline would be passing each sentence to this phrase modeler.
The second pass through the pipeline would then take the same phrase modeler and apply this phrase modeler to each sentence to identify the phrases that should be modeled together. Example:
- If
machine
andlearning
frequently appear next to each other in the corpus, they would be transformed to a single wordmachine_learning
instead.
Would this be possible to accomplish within Dataflow?
Can a build/requirements file be passed forcing pip install gensim
on the worker machines?