17
votes

Both DoFn and PTransform is a means to define operation for PCollection. How do we know which to use when?

1

1 Answers

28
votes

A simple way to understand it is by analogy with map(f) for lists:

  • The higher-order function map applies a function to each element of a list, returning a new list of the results. You might call it a computational pattern.
  • The function f is the logic applied to each element.

Now, switching to talk about Beam specifics, I think you are asking about ParDo.of(fn), which is a PTransform.

  • A PTransform is an operation that takes PCollections as input and yields PCollections as output. Beam has just five primitive types of PTransform, encapsulating embarrassingly parallel computational patterns.
  • ParDo is the computational pattern of per-element computation. It has some variations, but you don't need to worry about that for this question.
  • The DoFn, here I called it fn, is the logic that is applied to each element.

It may also help to think of the fact that you write a DoFn to say what to do on each element, and the Beam runner provides the ParDo to apply your logic.