1
votes

We are in the process of establishing a Machine Learning pipeline on Google Cloud, leveraging GC ML-Engine for distributed TensorFlow training and model serving, and DataFlow for distributed pre-processing jobs.

We would like to run our Apache Beam apps as DataFlow jobs on Google Cloud. looking at the ML-Engine samples it appears possible to get tensorflow_transform.beam.impl AnalyzeAndTransformDataset to specify which PipelineRunner to use as follows:

from tensorflow_transform.beam import impl as tft
pipeline_name = "DirectRunner"
p = beam.Pipeline(pipeline_name) 
p | "xxx" >> xxx | "yyy" >> yyy | tft.AnalyzeAndTransformDataset(...)

TemplatingDataflowPipelineRunner provides the ability to separate our preprocessing development from parameterized operations - see here: https://cloud.google.com/dataflow/docs/templates/overview - basically:

  • A) in PipelineOptions derived types, change option types to ValueProvider (python way: type inference or type hints ???)
  • B) change runner to TemplatingDataflowPipelineRunner
  • C) mvn archetype:generate to store template in GCS (python way: a yaml file like TF Hypertune ???)
  • D) gcloud beta dataflow jobs run --gcs-location —parameters

The question is: Can you show me how we can we use tf.Transform to leverage TemplatingDataflowPipelineRunner ?

2

2 Answers

6
votes

Python templates are available as of April 2017 (see documentation). The way to operate them is the following:

  • Define UserOptions subclassed from PipelineOptions.
  • Use the add_value_provider_argument API to add specific arguments to be parameterized.
  • Regular non-parameterizable options will continue to be defined using argparse's add_argument.
class UserOptions(PipelineOptions):
     @classmethod
     def _add_argparse_args(cls, parser):
         parser.add_value_provider_argument('--value_provider_arg', default='some_value')
         parser.add_argument('--non_value_provider_arg', default='some_other_value')

Note that Python doesn't have a TemplatingDataflowPipelineRunner, and neither does Java 2.X (unlike what happened in Java 1.X).

1
votes

Unfortunately, Python pipelines cannot be used as templates. It is only available for Java today. Since you need to use the python library, it will not be feasible to do this.

tensorflow_transform would also need to support ValueProvider so that you can pass in options as a value provider type through it.