8
votes

After reading Cloud Dataflow docs, I am still not sure how can I run my dataflow job from App Engine. Is it possible? Is it relevant whether my backend written in Python or in Java? Thanks!

3

3 Answers

4
votes

Yes it is possibile, you need to use the "Streaming execution" as mentioned here.

Using Google Cloud Pub/Sub as a streaming source you can use it as "trigger" of your pipeline.

From App Engine you can do the "Pub" action to the Pub/Sub Hub with the REST API.

1
votes

One way would indeed be to use Pub/Sub from within App Engine to let Cloud Dataflow know when new data is available. The Cloud Dataflow job would then run continuously and App Engine would provide the data for processing.

A different approach would be to add the code that sets up the Cloud Dataflow pipeline to a class in App Engine (including the Dataflow SDK to your GAE project) and set the job options programatically as explained here:

https://cloud.google.com/dataflow/pipelines/specifying-exec-params

Make sure to set the 'runner' option to DataflowPipelineRunner, so it executes asynchronously on the Google Cloud Platform. Since the pipeline runner (that actually runs your pipeline) does not have to be the same as the code that initiates it, this code (up until pipeline.run() ) could be in App Engine.

You can then add an endpoint or servlet to GAE that when called, runs the code that sets up the pipeline.

To schedule even more, you could have a cron job in GAE that calls the endpoint that initiates the pipeline...

0
votes

There might be a way to submit your Dataflow job from App Engine but this is not something that's actively supported as suggested by the lack of docs. APP Engine's runtime environment makes it more difficult to do some of the operations required, e.g. to obtain credentials, to submit Dataflow jobs.