How to use Google DataFlow Runner and Templates in tf.Transform?

Question

We are in the process of establishing a Machine Learning pipeline on Google Cloud, leveraging GC ML-Engine for distributed TensorFlow training and model serving, and DataFlow for distributed pre-processing jobs.

We would like to run our Apache Beam apps as DataFlow jobs on Google Cloud. looking at the ML-Engine samples it appears possible to get tensorflow_transform.beam.impl AnalyzeAndTransformDataset to specify which PipelineRunner to use as follows:

from tensorflow_transform.beam import impl as tft
pipeline_name = "DirectRunner"
p = beam.Pipeline(pipeline_name) 
p | "xxx" >> xxx | "yyy" >> yyy | tft.AnalyzeAndTransformDataset(...)

TemplatingDataflowPipelineRunner provides the ability to separate our preprocessing development from parameterized operations - see here: https://cloud.google.com/dataflow/docs/templates/overview - basically:

A) in PipelineOptions derived types, change option types to ValueProvider (python way: type inference or type hints ???)
B) change runner to TemplatingDataflowPipelineRunner
C) mvn archetype:generate to store template in GCS (python way: a yaml file like TF Hypertune ???)
D) gcloud beta dataflow jobs run --gcs-location —parameters

The question is: Can you show me how we can we use tf.Transform to leverage TemplatingDataflowPipelineRunner ?

María García Herrero María García Herrero · Accepted Answer · 2017-03-21T23:55:31

Python templates are available as of April 2017 (see documentation). The way to operate them is the following:

Define UserOptions subclassed from PipelineOptions.
Use the add_value_provider_argument API to add specific arguments to be parameterized.
Regular non-parameterizable options will continue to be defined using argparse's add_argument.

class UserOptions(PipelineOptions):
     @classmethod
     def _add_argparse_args(cls, parser):
         parser.add_value_provider_argument('--value_provider_arg', default='some_value')
         parser.add_argument('--non_value_provider_arg', default='some_other_value')

Note that Python doesn't have a TemplatingDataflowPipelineRunner, and neither does Java 2.X (unlike what happened in Java 1.X).

How to use Google DataFlow Runner and Templates in tf.Transform?

2 Answers