1
votes

Goal

My goal is to create a Dataflow template that specifies an Apache Beam pipeline. The pipeline runs in batch mode, reads from BigQuery, then performs transforms and writes elsewhere. Most importantly, the query I use for reading from BigQuery has to be Runtime provided.

Expected Behavior

The expected result is the pipeline will use the runtime parameter to specify the BigQuery query, execute the query, and then proceed with the rest of the pipeline.

Actual Behavior

The actual behavior is the runtime parameter I pass in is ignored, and instead, the parameter I had to specify when creating the GCS Template is used.

Relevant Code

Below is how I specify the read operation, and how the query parameter is defined and passed in.

public interface MyOptions extends PipelineOptions, StreamingOptions {
    @Description("Query String")
    ValueProvider<String> getQueryString();

    void setQueryString(ValueProvider<String> value);
}

public static void main(String[] args) {
        MyOptions options = PipelineOptionsFactory.fromArgs(args)
                .withValidation()
                .as(MyOptions.class);
        Pipeline p = Pipeline.create(options);

        PCollection<TableRow> tableRows =
                p.apply(BigQueryIO.readTableRows()
                        .fromQuery(options.getQueryString())
                        .withTemplateCompatibility()
                        .withoutValidation());
// Add this point I run my transformations and loading
}

To actually build the template and push to GCS, I do the following

mvn compile -Pdataflow-runner exec:java -Dexec.mainClass=com.Pipeline "-Dexec.args=--runner=DataflowRunner --queryString='SELECT time,type FROM [my-project:timeseries.my-data] where time between TIMESTAMP(\"2020-02-13T00:00:00Z\") and TIMESTAMP(\"2020-02-15T00:00:00Z\")'"

Finally, I use the Dataflow Web UI to pick the Template from GCS and do a deploy. At the bottom of the Web UI I specify my runtime parameters, where I set queryString and the runtime query I want to use.

Note: when I go to run the template in Dataflow, I specify queryString and I know for a fact it is being passed in. I rewrote my first transform to print out queryString and it correctly prints the specified runtime option. The problem is the "read from BigQuery" queryString is still the original one used when I made the template.

1

1 Answers

2
votes

After many iterations, I figured out the problem. There were actually 2, the largest being I did not need to actually pass the runtime parameter into the "build template" step.

  1. Do not pass the runtime parameter when building the pipeline. It seems obvious, but drop that from the mvn compile args
  2. Formatting the queryString as a runtime parameter was difficult. Below worked for me after many iterations
SELECT time,type FROM `my-project.timeseries.my-data` where time between TIMESTAMP(\"2019-02-13T00:00:00Z\") and TIMESTAMP(\"2020-02-15T00:00:00Z\")

Note the lack of quotes around the entire parameter and how the projectId.dataset.tableId was formatted.