I am new in big data processing. I am using apache beam Java SDK to work with it. Trying to understand how multithreading/parallel data processing works in apache beam pipeline. How data being process from one PTransform to another with respect to multi threading?
1 Answers
If you are talking about parallel data processing, then, generally speaking, Beam relies on leveraged data processing engine underneath (e.g. Spark, Flink, Dataflow, etc). Usually, you can't directly control the things like "how many workers will be used", "how your input PCollection
will be chunked and parallelised", etc - it will be a responsibility of used engine.
However, it's assumed that input data will be split into bundles and every instance of your DoFn
in pipeline will process one or more bundles on a worker but by one or more workers in the same time. In this way, data can be processed in parallel - every single bundle on every single worker. And there is no any coordination or sync mechanism among bundles (like we have with multithreading) - we have to assume that they are processed independently and in arbitrary order.
So, this is very "bird-eye" view. If you have any specific question - don't hesitate to ask.