Goal
My goal is to create a Dataflow template that specifies an Apache Beam pipeline. The pipeline runs in batch mode, reads from BigQuery, then performs transforms and writes elsewhere. Most importantly, the query I use for reading from BigQuery has to be Runtime provided.
Expected Behavior
The expected result is the pipeline will use the runtime parameter to specify the BigQuery query, execute the query, and then proceed with the rest of the pipeline.
Actual Behavior
The actual behavior is the runtime parameter I pass in is ignored, and instead, the parameter I had to specify when creating the GCS Template is used.
Relevant Code
Below is how I specify the read operation, and how the query parameter is defined and passed in.
public interface MyOptions extends PipelineOptions, StreamingOptions {
@Description("Query String")
ValueProvider<String> getQueryString();
void setQueryString(ValueProvider<String> value);
}
public static void main(String[] args) {
MyOptions options = PipelineOptionsFactory.fromArgs(args)
.withValidation()
.as(MyOptions.class);
Pipeline p = Pipeline.create(options);
PCollection<TableRow> tableRows =
p.apply(BigQueryIO.readTableRows()
.fromQuery(options.getQueryString())
.withTemplateCompatibility()
.withoutValidation());
// Add this point I run my transformations and loading
}
To actually build the template and push to GCS, I do the following
mvn compile -Pdataflow-runner exec:java -Dexec.mainClass=com.Pipeline "-Dexec.args=--runner=DataflowRunner --queryString='SELECT time,type FROM [my-project:timeseries.my-data] where time between TIMESTAMP(\"2020-02-13T00:00:00Z\") and TIMESTAMP(\"2020-02-15T00:00:00Z\")'"
Finally, I use the Dataflow Web UI to pick the Template from GCS and do a deploy. At the bottom of the Web UI I specify my runtime parameters, where I set queryString and the runtime query I want to use.
Note: when I go to run the template in Dataflow, I specify queryString and I know for a fact it is being passed in. I rewrote my first transform to print out queryString and it correctly prints the specified runtime option. The problem is the "read from BigQuery" queryString is still the original one used when I made the template.