DataFlow/Apache beam core transformations

Question

Do the core transformations (Map, Filter, Flatten) in the apache beam use parallel processing to process the data elements, if yes when should we use ParDo transformation specifically?

guillaume blaquiere guillaume blaquiere · Accepted Answer · 2021-05-23T07:43:37

Beam implement the concept of Map and Reduce. All the "MAP" operation, that means the operations can be performed unitary (filter, map, ...), can be done in parallel (on the same server with different thread or on different servers).

All the "REDUCE" operations, that required to compare a set (PCollection) of value together, are performed on the same server/thread.

So, use ParDo when you perform unitary operations, on a single entry in the PCollection.

DataFlow/Apache beam core transformations

2 Answers