1
votes

I have a dataflow made with apache-beam in python 3.7 where I process a file and then I have to delete it. The file comes from a google Storage bucket, and the problem is that when I use the DataflowRunner runner my job doesn't work because google-cloud-storage API is not installed in the Google Dataflow python 3.7 environment. Do you know guys how could I delete this file inside my dataflow without using this API? I've seen apache_beam modules like https://beam.apache.org/releases/pydoc/2.22.0/apache_beam.io.filesystem.html but I don't have any idea of how to use it, and haven't found a tutorial or example on how to use this module.

1
You should be able to include the GCS Python library when you run the pipeline by following beam.apache.org/documentation/sdks/python-pipeline-dependencies.Brent Worden

1 Answers

2
votes

I don't think you can delete while running the dataflow job. You have to delete the file after the dataflow job is completed. I normally recommend some kind of orchestration like apache airflow or Google Cloud Composer.

You can make a DAG in airflow as follows - enter image description here

Here,

"Custom DAG Workflow" will have the dataflow job.
"Custom Python Code" will have the python code to delete the file

Refer - https://github.com/GoogleCloudPlatform/professional-services/tree/master/examples/cloud-composer-examples/composer_dataflow_examples