4
votes

I have a Streaming dataflow running to read the PUB/SUB subscription.

After a period of a time or may be after processing certain amount of data, i want the pipeline to stop by itself. I don't want my compute engine instance to be running indefinitely.

When i cancel the job through dataflow console, it is shown as failed job.

Is there a way to achieve this? am i missing something ? Or that feature is missing in the API.

2
it almost sounds like you shouldn't be running in streaming mode, but rather in batch. What's your use case that you need to run in streaming mode?Graham Polley
I have to streaming mode since my input is through PUB/SUB. Since streaming job is running for ever, i want to stop itBharathi
Sounds weird that you chose to design your application using pub/sub & the streaming runner, when you want it to stop after X amount of data has been processed. Sound like classic batch. Anyway, I can't see anything in the API/SDK to currently cancel the job. You could just stop/delete the VMs in the pipeline's workerpool. It would probably fail/cancel then. Would that do the trick?Graham Polley
We're already considering adding a variation of pub/sub source for use in batch mode, similarly to what Bharathi is suggesting ("read for a certain time" or "read a certain amount of data") - it is a valid use case that fits well with Dataflow's idea of unifying streaming and batch.jkff

2 Answers

5
votes

Could you do something like this?

Pipeline pipeline = ...;
... (construct the streaming pipeline) ...
final DataflowPipelineJob job =
    DataflowPipelineRunner.fromOptions(pipelineOptions)
                          .run(pipeline);
Thread.sleep(your timeout);
job.cancel();
0
votes

I was able to drain (canceling a job without losing data) a running streaming job on data flow with Rest API.

See my answer

Use Rest Update method, with this body :

{ "requestedState": "JOB_STATE_DRAINING" }