1
votes

We have a large table in BigQuery where the data is streaming in. Each night, we want to run Cloud Dataflow pipeline which processes the last 24 hours of data.

In BigQuery, it's possible to do this using a 'Table Decorator', and specifying the range we want i.e. 24 hours.

Is the same functionality somehow possible in Dataflow when reading from a BQ table?

We've had a look at the 'Windows' documentation for Dataflow, but we can't quite figure if that's what we need. We came up with up with this so far (we want the last 24 hours of data using FixedWindows), but it still tries to read the whole table:

pipeline.apply(BigQueryIO.Read
                .named("events-read-from-BQ")
                .from("projectid:datasetid.events"))
                .apply(Window.<TableRow>into(FixedWindows.of(Duration.standardHours(24))))
                .apply(ParDo.of(denormalizationParDo)
                        .named("events-denormalize")
                        .withSideInputs(getSideInputs()))
                .apply(BigQueryIO.Write
                        .named("events-write-to-BQ")
                        .to("projectid:datasetid.events")
                        .withSchema(getBigQueryTableSchema())
                        .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE)                          .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED));

Are we on the right track?

1

1 Answers

5
votes

Thank you for your question.

At this time, BigQueryIO.Read expects table information in "project:dataset:table" format, so specifying decorators would not work.
Until support for this is in place, you can try the following approaches:

  • Run a batch stage which extracts the whole bigquery and filters out unnecessary data and process that data. If the table is really big, you may want to fork the data into a separate table if the amount of data read is significantly smaller than the total amount of data.
  • Use streaming dataflow. For example, you may publish the data onto Pubsub, and create a streaming pipeline with a 24hr window. The streaming pipeline runs continuously, but provides sliding windows vs. daily windows.

Hope this helps