We are in the process of establishing a Machine Learning pipeline on Google Cloud, leveraging GC ML-Engine for distributed TensorFlow training and model serving, and DataFlow for distributed pre-processing jobs.
We would like to run our Apache Beam apps as DataFlow jobs on Google Cloud. looking at the ML-Engine samples it appears possible to get tensorflow_transform.beam.impl AnalyzeAndTransformDataset to specify which PipelineRunner to use as follows:
from tensorflow_transform.beam import impl as tft
pipeline_name = "DirectRunner"
p = beam.Pipeline(pipeline_name)
p | "xxx" >> xxx | "yyy" >> yyy | tft.AnalyzeAndTransformDataset(...)
TemplatingDataflowPipelineRunner provides the ability to separate our preprocessing development from parameterized operations - see here: https://cloud.google.com/dataflow/docs/templates/overview - basically:
- A) in PipelineOptions derived types, change option types to ValueProvider (python way: type inference or type hints ???)
- B) change runner to TemplatingDataflowPipelineRunner
- C) mvn archetype:generate to store template in GCS (python way: a yaml file like TF Hypertune ???)
- D) gcloud beta dataflow jobs run --gcs-location —parameters
The question is: Can you show me how we can we use tf.Transform to leverage TemplatingDataflowPipelineRunner ?