Autodetect BigQuery schema within Dataflow?

Question

Is it possible to use the equivalent of --autodetect in DataFlow?

i.e. can we load data into a BQ table without specifying a schema, equivalent to how we can load data from a CSV with --autodetect?

(potentially related question)

Please clarify your specific problem or add additional details to highlight exactly what you need. As it's currently written, it’s hard to tell exactly what you're asking. See the How to Ask page for help clarifying this question. — Mikhail Berlyant

Matthias Baetens Matthias Baetens · Accepted Answer · 2017-05-16T20:44:08

If you are using protocol buffers as objects in your PCollections (which should be performing very well on the Dataflow back-end) you might be able to use a util I wrote in the past. It will parse the schema of the protobuffer into a BigQuery schema at runtime, based on inspection of the protobuffer descriptor.

I quickly uploaded it to GitHub, it's WIP, but you might be able to use it or be inspired to write something similar using Java Reflection (I might do it myself at some point).

You can use the util as follows:

TableSchema schema = ProtobufUtils.makeTableSchema(ProtobufClass.getDescriptor());
enhanced_events.apply(BigQueryIO.Write.to(tableToWrite).withSchema(schema)
            .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
            .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE));

where the create disposition will create the table with the schema specified and the ProtobufClass is the class generated using your Protobuf schema and the proto compiler.

Autodetect BigQuery schema within Dataflow?

2 Answers