I have some working knowledge about Python but pretty new to Apache Beam. I have encountered an example from Apache Beam about a simple word count program. The snippet that I'm confused looks like this:
pipeline_options = PipelineOptions(pipeline_args)
pipeline_options.view_as(SetupOptions).save_main_session = save_main_session
with beam.Pipeline(options=pipeline_options) as p:
# Read the text file[pattern] into a PCollection.
lines = p | ReadFromText(known_args.input)
# Count the occurrences of each word.
counts = (
lines
| 'Split' >> (
beam.FlatMap(lambda x: re.findall(r'[A-Za-z\']+', x)).
with_output_types(unicode))
| 'PairWithOne' >> beam.Map(lambda x: (x, 1))
| 'GroupAndSum' >> beam.CombinePerKey(sum))
# Format the counts into a PCollection of strings.
def format_result(word_count):
(word, count) = word_count
return '%s: %s' % (word, count)
output = counts | 'Format' >> beam.Map(format_result)
# Write the output using a "Write" transform that has side effects.
# pylint: disable=expression-not-assigned
output | WriteToText(known_args.output)
The full version of the code is here: https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/wordcount_minimal.py
I'm very confused by the "|" and ">>" operators used here. What do they mean here? Are they natively supported in Python?
Pipelineobjects have overwritten operators which do objects specific stuff. They are described in the documentation (which has substantial room for improvement IMHO). - Klaus D.