How do I write to a date-partitioned BigQuery table using a date based on the data itself in a batch Dataflow job?

Question

I have a number of text files with data that I want to import to a date-partitioned BigQuery table from a DataflowPipelineRunner running in batch mode. Instead of inserting to the partition of the current day at runtime I want to insert into a partition based on a date mentioned in each row. (Unfortunately I can't use the bqcommand line tool to import the text files directly since I need to transform some of the values.)

I have tried to insert by outputting a timestamp from the ParDo function that is windowed into days and then applying that window and outputting table name suffixed by $and the corresponding date.

BigQueryIO.Write.to(new SerializableFunction<BoundedWindow, String>() {
  public String apply(BoundedWindow window) {
    String dayString = DateTimeFormat.forPattern("yyyyMMdd")
                         .withZone(DateTimeZone.forID("Europe/Stockholm"))
                         .print(((IntervalWindow)window).start());
    return dataset  + "$" + dayString;
  }
})
.withSchema(schema.getSchema())
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND));

When I try to run this I am affected by a Dataflow bug. I also found out that

Per-window tables are not yet supported in batch mode.

So how I can I write to a date-partitioned table with a specified date as partition?

This is now available both in batch and streaming modes: stackoverflow.com/questions/43505534/… — jkff

danielm danielm · Accepted Answer · 2016-09-30T16:52:49

If you have a relatively small, fixed number of tables you need to output to, you can create a separate BigQueryIO.Write transform for each table, and then partition your data based on the date. If the number of output tables is very large, there is not currently a good solution until batch Dataflow supports per-window tables.

How do I write to a date-partitioned BigQuery table using a date based on the data itself in a batch Dataflow job?

1 Answers