DataFlow - Read from BigQuery partitioned table

Question

Using DataFlow SDK 2.x, I would like to read data from BigQuery partitioned table.

The only option I found so far is using BigQueryIO.Read.fromQuery(String query) method and query SELECT * FROM table WHERE _PARTITIONTIME = 'yyyy-MM-dd'.

Is there any other way to read from partitioned BigQuery table in DataFlow?

This is the correct way. If you use a normal read, it will read the entire table into your pipeline, regardless if it's a partitioned table or not. — Graham Polley
@GrahamPolley, thanks for the answer. Do you plan adding method-level support for querying partitioned tables? — bleare
You can also use a partition decorator, read from "table$yyyyMMdd" should work I believe. There are currently no plans on adding more explicit support as far as I'm aware; if you need it, feel free to file a JIRA at issues.apache.org/jira/browse/BEAM or post on the user@ mailing list. — jkff
It is documented at cloud.google.com/bigquery/docs/partitioned-tables — jkff

Haris Nadeem Haris Nadeem · Accepted Answer · 2018-03-27T05:31:36

Well to be honest, there is a better way to do it. Two ways to do it for performance and one if you look into BigQuery API.

So recall that

    Select * from `tableName`

is an expensive operation for your database. In order to speed up the read time you could instead:

    Select <field1>, <field2>, ...., from `tableName`

Of course, listing all the field names is not possible in many cases, but you will get roughly twice the speed up.

But instead of doing that you can follow the example here. This will also give you roughly twice the speed up as well and is much more convenient.

Let me know if you need a coded example of the link provided, Cheers!

DataFlow - Read from BigQuery partitioned table

1 Answers