How do I write to BigQuery using a schema computed during Dataflow execution?

Question

I have the following scenario:

Pipeline A looks up table A in BigQuery, does some computation and returns a list of column names.
This list of columns names is used as the BigQuery schema for output of pipeline B.

Can you please let me know what is the best option to achieve this?

Can pipeline A use TextIO to write the list of column names to temporary or staging location files which are then read by the Pipeline executor to define the schema for pipeline B. If this approach looks fine, can you please let me know if there is a Dataflow utility to read files from temporary or staging location or if the GCS API should be used.

Lukasz Cwik Lukasz Cwik · Accepted Answer · 2015-04-03T22:01:41

You would need to do the following:

Construct Pipeline A to write to some location such as GCS (any durable location which you can reference when constructing pipeline B would work).
Use the BlockingDataflowPipelineRunner to run and wait till Pipeline A is done.
Construct Pipeline B by using the schema information by reading from the location you defined in step 1.
Run Pipeline B.

I would not use the temporary location because we may clean it up before you get around to constructing Pipeline B. The staging location (if different from the temporary location) can be used. I would also advise to use a unique file name so that if Pipeline A runs multiple times, you don't read in stale results with Pipeline B.

This should help you read from and write to GCS: https://github.com/GoogleCloudPlatform/DataflowJavaSDK/blob/master/sdk/src/main/java/com/google/cloud/dataflow/sdk/util/GcsUtil.java

You can get an instance of GcsUtil from the PipelineOptions object: https://github.com/GoogleCloudPlatform/DataflowJavaSDK/blob/master/sdk/src/main/java/com/google/cloud/dataflow/sdk/options/GcsOptions.java#L43

How do I write to BigQuery using a schema computed during Dataflow execution?

2 Answers