0
votes

I am currently using the stopit library https://github.com/glenfant/stopit to set per element processing timeouts in batch jobs. These jobs work on the direct runner and I am able to timeout functions that take too long.

What is the beam way of setting a per element process timeout for a batch job?

Is there a way I could set a processing timeout with a trigger for a dataflow batch job?

My use case is extracting named entities from a text. The NER process sometimes takes too long if the document being processed is too long.

It would be nice to get rid of this dependency and move to a beam native solution.

1
According to the Apache Beam documentation you can use the waituntilfinish() method with direct runner or data flow runner. Using this method you can define the amount of time that the pipeline will time out. Is that what you are looking for?Alexandre Moraes
I am looking for a per element process timeout not a whole pipeline timeout. I have updated the question as well to clarify this. Thanks for the helpful suggestion though!swartchris8
Can you please exemplify what you are doing and why you are setting a time out for each element? So I can understand better and help you.Alexandre Moraes
Running an NLP extraction pipeline on each piece of text, the execution time of the code depends on the text. For particular texts the NLP library I am using can run very slowly and I would like to terminate the slow running extractions and reprocess them on a different machine.swartchris8

1 Answers

1
votes

As per my understanding the answer to the question is timely processing which also maintain state. Lets say you have a function f and get the output for any batch that will use the function. So basically we have to mark the batch(update a state) after we receive a batch and we will set timers that will update the output as per the watermark/expiry timing which can be set, if there is any output we will receive it and if we don't any output from the function as per your query, we can surely re-route that batch.

This is not an exact solution, but can be worked out

For better understanding you can read it here: apache beam documentation enter image description here