Speculative Execution in Apache Beam

Question

I don't see any mention of speculative execution in Apache Beam documentation. But this post claims that it has something like that.

the ParDo transformation is fault-tolerant, i.e. if it crashes, it's rerun. The transformation also has a concept of speculative execution (read about speculative execution in Spark, both are similar basics). The processing for given subset of dataset can be executed on 2 different workers at any time. The results coming from the quickest worker are later used and for the slower one are discarded. At this occasion it's important to emphasize that ParDo implementation must be aware of parallel execution on the same subset of data.

Is it true?

The most relevant doc about execution model is beam.apache.org/documentation/runtime/model. I don't think it's a standard for all runners. It more likely describes what Dataflow runner does. — Rui Wang

Alexey Romanenko Alexey Romanenko · Accepted Answer · 2019-10-30T13:19:40

I believe that speculative execution is a responsibility of data processing engine, not Beam. Though, one of the requirement for a Beam transform is to be idempotent because Beam model provides no guarantees as to the number of times your user code might be invoked or retried (see transform requirements).

Speculative Execution in Apache Beam

2 Answers