Do the core transformations (Map, Filter, Flatten) in the apache beam use parallel processing to process the data elements, if yes when should we use ParDo transformation specifically?
2 Answers
Beam implement the concept of Map and Reduce. All the "MAP" operation, that means the operations can be performed unitary (filter, map, ...), can be done in parallel (on the same server with different thread or on different servers).
All the "REDUCE" operations, that required to compare a set (PCollection) of value together, are performed on the same server/thread.
So, use ParDo when you perform unitary operations, on a single entry in the PCollection.
I will refer you to apache_beam docs.
In simple terms, you use ParDo when you have a "user defined function" that you want to apply to your pipeline, for example you want to split every sentence in a paragraph, into single words.You would want to apply a split() function but split() is not one of the Core beam Transforms, so ParDo lets you smuggle it in.