49
votes

I have read through the Beam documentation and also looked through Python documentation but haven't found a good explanation of the syntax being used in most of the example Apache Beam code.

Can anyone explain what the _ , | , and >> are doing in the below code? Also is the text in quotes ie 'ReadTrainingData' meaningful or could it be exchanged with any other label? In other words how is that label being used?

train_data = pipeline | 'ReadTrainingData' >> _ReadData(training_data)
evaluate_data = pipeline | 'ReadEvalData' >> _ReadData(eval_data)

input_metadata = dataset_metadata.DatasetMetadata(schema=input_schema)

_ = (input_metadata
| 'WriteInputMetadata' >> tft_beam_io.WriteMetadata(
       os.path.join(output_dir, path_constants.RAW_METADATA_DIR),
       pipeline=pipeline))

preprocessing_fn = reddit.make_preprocessing_fn(frequency_threshold)
(train_dataset, train_metadata), transform_fn = (
  (train_data, input_metadata)
  | 'AnalyzeAndTransform' >> tft.AnalyzeAndTransformDataset(
      preprocessing_fn))
1
The idea is to take a .method() syntax and turn it into an infix operator. For example taking something like a.plus(b) and making it possible to write syntax like a + b . Check out the __or__ , __ror__ , and __rrshift__ function definitions in the source github.com/apache/beam/blob/master/sdks/python/apache_beam/…. So MyPTransform | NextPTransform is really taking both of those PTransform objects and passing them as a list _parts to a new _ChainedPTransform object, and if you __or__ yet another PTransform, it flattens the nested lists.Davos
>> aka __rrshift__ is effectively setting the label attribute of the PTransform but it doesn't just do something like Ptransform.label('new name') or Ptransform.label = 'new name', it seems more convoluted to me. 'ReadEvalData' >> _ReadData(eval_data) is evaluated as returning a new _NamedPTransform object with _ReadData(eval_data) initialized as the self.transform attribute and the string 'ReadEvalData' is initialized as the label attribute by using Super to run the parent class PTransform's init method.Davos
__ror__ is the same as __or__, they both end up as the pipe infix | but __ror__ lets you define how to pipe other objects to a PTransform where those other objects don't have the pipe operator method defined. It kind of makes the pipe operator of the thing on the right still work from left-to-right.Davos
Extremely convoluted, creating a new object with the current object as an attribute, just to get a functional / shell style flow going. It does make it easier to read the processing pipeline overall from left to right though, and looks similar to functional programming syntax like F# or using MagrittR in Rlang. It is probably better than a lot of nested function calls like PTransforrm.apply(NextPTransform.apply(YetAnotherPTransform)) but then you can always create new variables for each step. After all, it is lazily evaluated and it is not doing deep copies of data so there's no penalty.Davos

1 Answers

67
votes

Operators in Python can be overloaded. In Beam, | is a synonym for apply, which applies a PTransform to a PCollection to produce a new PCollection. >> allows you to name a step for easier display in various UIs -- the string between the | and the >> is only used for these display purposes and identifying that particular application.

See https://beam.apache.org/documentation/programming-guide/#transforms