May I know the exact difference between a ParDo and a Combine transformation in Apache Beam?
Can I see ParDo as the Map phase in the map/shuffle/reduce while Combine as the reduce phase?
Thank you!
MapReduce is limited to graphs of the shape Map-Shuffle-Reduce, where Reduce is an elementwise operation, just like map, that is distinguished only by following the shuffle.
In Apache Beam, one can have arbitrary topologies, e.g.
Map-Map-Shuffle-Map-Shuffle-Map-Map-Shuffle-Map
so the notion of breaking phases down by that which follows shuffle no longer holds. (Beam calls Map/Shuffle ParDo and GroupByKey respectively.)
Combine operations are a special kind of Map operations that are known to be associative (think sum
, max
, etc. but they can be much more complicated) which allow us to push part of the work before the shuffle, e.g.
Shuffle-Sum
becomes
PartialSum-Shuffle-Sum
(Most MapReduce systems also have this notion, named combining or semi-reducing or similar.)
Note that Beam's CombinePerKey and GlobalCombine operations pair the shuffle with the CombineFn, no need to GroupByKey first.
As far as I have understand Apache Beam, there are no explicit Map and Reduce phases.
You can apply several element-wise map functions in a row, where ParDo
is the most general class that can be used for own implementation.
The term reduce has been replaced by aggregation and there the corresponding class is Combine
.