1
votes

I don't see any mention of speculative execution in Apache Beam documentation. But this post claims that it has something like that.

the ParDo transformation is fault-tolerant, i.e. if it crashes, it's rerun. The transformation also has a concept of speculative execution (read about speculative execution in Spark, both are similar basics). The processing for given subset of dataset can be executed on 2 different workers at any time. The results coming from the quickest worker are later used and for the slower one are discarded. At this occasion it's important to emphasize that ParDo implementation must be aware of parallel execution on the same subset of data.

Is it true?

2
The most relevant doc about execution model is beam.apache.org/documentation/runtime/model. I don't think it's a standard for all runners. It more likely describes what Dataflow runner does.Rui Wang

2 Answers

3
votes

I believe that speculative execution is a responsibility of data processing engine, not Beam. Though, one of the requirement for a Beam transform is to be idempotent because Beam model provides no guarantees as to the number of times your user code might be invoked or retried (see transform requirements).

0
votes

There is no similar design in beam. You can look at the documentation here [1] which has lot of details around this topic.

https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ParDo.java#L365