How to batch streaming inserts to BigQuery from a Beam job

Question

I'm writing to BigQuery in a beam job from an unbounded source. I'm using STREAMING INSERTS as the Method. I was looking at how to throttle the rows to BigQuery based on the recommendations in

https://cloud.google.com/bigquery/quotas#streaming_inserts

The BigQueryIO.Write API doesn't provide a way to set the micro batches.

I was looking at using Triggers but not sure if BigQuery groups everything in a pane into a request. I've setup the trigger as below

    Window.<Long>into(new GlobalWindows())
    .triggering(
        Repeatedly.forever(
            AfterFirst.of(
                AfterPane.elementCountAtLeast(5),
                AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardMinutes(2)))
            ))
    .discardingFiredPanes());

Q1. Does Beam support micro batches or does it create one request for each element in the PCollection?

Q2. If the above trigger makes sense? Even If I set the window/trigger it could be sending one request for every element.

My intent was to have at least 500 rows (as suggested by the documentation) or wait for a predefined time to submit an insert request, so it can balance the latency and throughput. My use case is to have to data as real time as possible — PK109
After going through the source code I figured Beam bigquery does create small batches. github.com/apache/beam/blob/master/sdks/java/io/… shows that in finishBundle method it tries to send multiple rows in a request. But this is class is marked internal (package level). There is no way to see how many it batches or how it throttles during varying rate of input. It would be ideal to see some explanation of how it behaves — PK109
This might be taking a diversion, the source github.com/apache/beam/blob/master/sdks/java/io/… shows method writeAndGetErrors shows it applies a global window to the collection similar to what I have posted in the question. So that would override whatever WindowFn that has been applied before the write to BigQuery. What is the implication of this? — PK109

vdolez vdolez · Accepted Answer · 2019-03-10T03:50:48

I don't know what you mean by micro-batch. The way I see it, BigQuery support loading data either as batches, either in streaming.

Basically, batch loads are subject to quotas and streaming loads are a bit more expensive.

Once you set the insertion method for your BigQueryIO the documentation states :

Note: If you use batch loads in a streaming pipeline, you must use withTriggeringFrequency to specify a triggering frequency.

Never tried it, but withTriggeringFrequency seems to be what you need here.

How to batch streaming inserts to BigQuery from a Beam job

1 Answers