Understanding “|” and “>>” from the Beam Python example

Question

I have some working knowledge about Python but pretty new to Apache Beam. I have encountered an example from Apache Beam about a simple word count program. The snippet that I'm confused looks like this:

  pipeline_options = PipelineOptions(pipeline_args)
  pipeline_options.view_as(SetupOptions).save_main_session = save_main_session
  with beam.Pipeline(options=pipeline_options) as p:

    # Read the text file[pattern] into a PCollection.
    lines = p | ReadFromText(known_args.input)

    # Count the occurrences of each word.
    counts = (
        lines
        | 'Split' >> (
            beam.FlatMap(lambda x: re.findall(r'[A-Za-z\']+', x)).
            with_output_types(unicode))
        | 'PairWithOne' >> beam.Map(lambda x: (x, 1))
        | 'GroupAndSum' >> beam.CombinePerKey(sum))

    # Format the counts into a PCollection of strings.
    def format_result(word_count):
      (word, count) = word_count
      return '%s: %s' % (word, count)

    output = counts | 'Format' >> beam.Map(format_result)

    # Write the output using a "Write" transform that has side effects.
    # pylint: disable=expression-not-assigned
    output | WriteToText(known_args.output)

The full version of the code is here: https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/wordcount_minimal.py

I'm very confused by the "|" and ">>" operators used here. What do they mean here? Are they natively supported in Python?

Pipeline objects have overwritten operators which do objects specific stuff. They are described in the documentation (which has substantial room for improvement IMHO). — Klaus D.
Thanks for linking the resources. I didn't think of operator overloading and thought this was something in Python idioms that I didn't know.. — Archer

srishtigarg srishtigarg · Accepted Answer · 2020-12-31T08:38:35

Since this code is written in Beam, the symbols you are talking about are native to Beam Pipeline. | is the pipeline symbol which indicates the pipeline being addressed to for the given operation: Like in your example, p is the source pipeline for lines = p | ReadFromText(known_args.input) and lines is the source pipeline for

counts = (
        lines
        | 'Split' >> (
            beam.FlatMap(lambda x: re.findall(r'[A-Za-z\']+', x)).
            with_output_types(unicode))
        | 'PairWithOne' >> beam.Map(lambda x: (x, 1))
        | 'GroupAndSum' >> beam.CombinePerKey(sum))

>> gives a name to a certain operation for ease of reading on the UI.

In your example, 'GroupAndSum' >> beam.CombinePerKey(sum)), GroupAndSum is the name of the combine operation and so on.

Read the documentation given by @Klaus D. in the comments for more clarity.

Understanding “|” and “>>” from the Beam Python example

1 Answers