Apache Beam: DoFn vs PTransform

Question

Both DoFn and PTransform is a means to define operation for PCollection. How do we know which to use when?

Kenn Knowles Kenn Knowles · Accepted Answer · 2017-12-08T03:48:29

A simple way to understand it is by analogy with map(f) for lists:

The higher-order function map applies a function to each element of a list, returning a new list of the results. You might call it a computational pattern.
The function f is the logic applied to each element.

Now, switching to talk about Beam specifics, I think you are asking about ParDo.of(fn), which is a PTransform.

A PTransform is an operation that takes PCollections as input and yields PCollections as output. Beam has just five primitive types of PTransform, encapsulating embarrassingly parallel computational patterns.
ParDo is the computational pattern of per-element computation. It has some variations, but you don't need to worry about that for this question.
The DoFn, here I called it fn, is the logic that is applied to each element.

It may also help to think of the fact that you write a DoFn to say what to do on each element, and the Beam runner provides the ParDo to apply your logic.

Apache Beam: DoFn vs PTransform

1 Answers