3
votes

In Google cloud platform - Dataflow, when streaming unbounded PCollection (say from PubSub topic using PubSubIO), is there an efficient way to start and stop the beam pipeline in Dataflow? (Example running at start of day and ending at end of day) Is the only way to have a scheduler to have a Cron App engine service and to start the above pipeline job and then stop the job? Just looking at if there are any other options out there.

Also, in case if I choose windowing for the unbounded PCollection(say from PubSub), is there a way to have the files written in a configurable directory say. hourly directory for every window? I see it creates one file for every window.

2

2 Answers

3
votes

I agree with Pablo that Airflow (and Cloud Composer from the GCP side) is a good choice for the first part of your question.

Regarding the second part of your question, you can see the Google-Provided Dataflow Template for streaming pipeline from Cloud Pub/Sub to Google Cloud Storage files, you can easily create hourly directories by setting the outputDirectory to gs:///YYYY/MM/DD/HH/ and it will automatically replace YYYY, MM, DD and HH for the values of the interval window.

If you need to adapt this template to your specific needs, you can check the source code of the template.

1
votes

You should check out Apache Airflow (incubating), which is a new project donated by AirBnB, and which allows to schedule workflows, among of which Apache Beam is supported as well.