Combine vs ParDo in apache beam

Question

May I know the exact difference between a ParDo and a Combine transformation in Apache Beam?

Can I see ParDo as the Map phase in the map/shuffle/reduce while Combine as the reduce phase?

Thank you!

robertwb robertwb · Accepted Answer · 2020-08-31T18:21:13

MapReduce is limited to graphs of the shape Map-Shuffle-Reduce, where Reduce is an elementwise operation, just like map, that is distinguished only by following the shuffle.

In Apache Beam, one can have arbitrary topologies, e.g.

Map-Map-Shuffle-Map-Shuffle-Map-Map-Shuffle-Map

so the notion of breaking phases down by that which follows shuffle no longer holds. (Beam calls Map/Shuffle ParDo and GroupByKey respectively.)

Combine operations are a special kind of Map operations that are known to be associative (think sum, max, etc. but they can be much more complicated) which allow us to push part of the work before the shuffle, e.g.

Shuffle-Sum

becomes

PartialSum-Shuffle-Sum

(Most MapReduce systems also have this notion, named combining or semi-reducing or similar.)

Note that Beam's CombinePerKey and GlobalCombine operations pair the shuffle with the CombineFn, no need to GroupByKey first.

Combine vs ParDo in apache beam

2 Answers