How to pass requirements.txt parameter in Dataflow when Dataflow is being triggered by Cloud Function?

Question

Objective- I have a dataflow template (written in python) that has a dependency on pandas and nltk also I want to trigger the dataflow job from cloud function. For this purpose, I have uploaded the code to a bucket and I am ready to specify the template location in the cloud function.

Problem- How to pass the requirements_file parameter that you would normally pass to install any third-party library when you trigger a dataflow job using the discovery google module from cloud function?

Prerequisites- I know this can be done when you are launching a job through the local machine by specifying a local directory path but when I try to specify the path from GCS such as --requirements_file gs://bucket/requirements.txt it gives me an error saying:

The file gs://bucket/requirements.txt cannot be found. It was specified in the --requirements_file command line option.

Have you deployed the Google Cloud Function how you would normally do via gcloud functions deploy? Have a look here which is a quickstart and shows how to specify dependencies. — yvesonline
Yeah, it mentions requirements.txt for google-cloud-function and not google-dataflow isn't it? — john mich
OK I'm getting a little confused then. So you have a Google Cloud Function which you want to use to trigger a Cloud Dataflow pipeline, correct? How does that function look like? — yvesonline
doc- cloud.google.com/dataflow/docs/guides/templates/… code- github.com/GoogleCloudPlatform/python-docs-samples/blob/master/… — john mich
I was just about to ask, does it work if you trigger it via gcloud dataflow jobs run? — yvesonline

Rishabh Jain Rishabh Jain · Accepted Answer · 2020-04-15T13:14:56

The template of dataflow is not a python or java code instead it is a compiled version of the code that you've written in the python or java. So, when you're creating your template you may pass your requirements.txt in the arguments like you normally do as shown below

python dataflow-using-cf.py \
    --runner DataflowRunner \
    --project <PROJECT_ID> \
    --staging_location gs://<BUCKET_NAME>/staging \
    --temp_location gs://<BUCKET_NAME>/temp \
    --template_location ./template1 \
    --requirements_file ./requirements.txt \

The above command will create a file with name template1 which if you read, contains a JSON structure, this file is a compiled version of the Dataflow code that you've written and during the compilation process, it will read your requirements.txt from your local directory and compile its steps. You may then add your template to a bucket and provide the path to the cloud function, you don't have to worry about the requirements.txt file after creating a template.

How to pass requirements.txt parameter in Dataflow when Dataflow is being triggered by Cloud Function?

1 Answers