2
votes

May I know the exact difference between a ParDo and a Combine transformation in Apache Beam?

Can I see ParDo as the Map phase in the map/shuffle/reduce while Combine as the reduce phase?

Thank you!

2

2 Answers

1
votes

MapReduce is limited to graphs of the shape Map-Shuffle-Reduce, where Reduce is an elementwise operation, just like map, that is distinguished only by following the shuffle.

In Apache Beam, one can have arbitrary topologies, e.g.

Map-Map-Shuffle-Map-Shuffle-Map-Map-Shuffle-Map

so the notion of breaking phases down by that which follows shuffle no longer holds. (Beam calls Map/Shuffle ParDo and GroupByKey respectively.)

Combine operations are a special kind of Map operations that are known to be associative (think sum, max, etc. but they can be much more complicated) which allow us to push part of the work before the shuffle, e.g.

Shuffle-Sum

becomes

PartialSum-Shuffle-Sum

(Most MapReduce systems also have this notion, named combining or semi-reducing or similar.)

Note that Beam's CombinePerKey and GlobalCombine operations pair the shuffle with the CombineFn, no need to GroupByKey first.

1
votes

As far as I have understand Apache Beam, there are no explicit Map and Reduce phases.

You can apply several element-wise map functions in a row, where ParDo is the most general class that can be used for own implementation.

The term reduce has been replaced by aggregation and there the corresponding class is Combine.