How multithreading works in apache beam pipeline with bounded sources?

Question

I am new in big data processing. I am using apache beam Java SDK to work with it. Trying to understand how multithreading/parallel data processing works in apache beam pipeline. How data being process from one PTransform to another with respect to multi threading?

By "multithreading" do you actually mean "parallel data processing" or something else? Could you elaborate? — Alexey Romanenko
@AlexeyRomanenko Yes i mean parallel data processing. I have updated my question — Sam

Alexey Romanenko Alexey Romanenko · Accepted Answer · 2020-08-04T11:12:41

If you are talking about parallel data processing, then, generally speaking, Beam relies on leveraged data processing engine underneath (e.g. Spark, Flink, Dataflow, etc). Usually, you can't directly control the things like "how many workers will be used", "how your input PCollection will be chunked and parallelised", etc - it will be a responsibility of used engine.

However, it's assumed that input data will be split into bundles and every instance of your DoFn in pipeline will process one or more bundles on a worker but by one or more workers in the same time. In this way, data can be processed in parallel - every single bundle on every single worker. And there is no any coordination or sync mechanism among bundles (like we have with multithreading) - we have to assume that they are processed independently and in arbitrary order.

So, this is very "bird-eye" view. If you have any specific question - don't hesitate to ask.

How multithreading works in apache beam pipeline with bounded sources?

1 Answers