I have a dataflow made with apache-beam in python 3.7 where I process a file and then I have to delete it. The file comes from a google Storage bucket, and the problem is that when I use the DataflowRunner runner my job doesn't work because google-cloud-storage API is not installed in the Google Dataflow python 3.7 environment. Do you know guys how could I delete this file inside my dataflow without using this API? I've seen apache_beam modules like https://beam.apache.org/releases/pydoc/2.22.0/apache_beam.io.filesystem.html but I don't have any idea of how to use it, and haven't found a tutorial or example on how to use this module.
1
votes
1 Answers
2
votes
I don't think you can delete while running the dataflow job. You have to delete the file after the dataflow job is completed. I normally recommend some kind of orchestration like apache airflow or Google Cloud Composer.
You can make a DAG in airflow as follows -
Here,
"Custom DAG Workflow" will have the dataflow job.
"Custom Python Code" will have the python code to delete the file