1
votes

I'm writing to BigQuery in a beam job from an unbounded source. I'm using STREAMING INSERTS as the Method. I was looking at how to throttle the rows to BigQuery based on the recommendations in

https://cloud.google.com/bigquery/quotas#streaming_inserts

The BigQueryIO.Write API doesn't provide a way to set the micro batches.

I was looking at using Triggers but not sure if BigQuery groups everything in a pane into a request. I've setup the trigger as below

    Window.<Long>into(new GlobalWindows())
    .triggering(
        Repeatedly.forever(
            AfterFirst.of(
                AfterPane.elementCountAtLeast(5),
                AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardMinutes(2)))
            ))
    .discardingFiredPanes());

Q1. Does Beam support micro batches or does it create one request for each element in the PCollection?

Q2. If the above trigger makes sense? Even If I set the window/trigger it could be sending one request for every element.

1
this dos streaming inserts, Why not use FILE LOADS?Pablo
My intent was to have at least 500 rows (as suggested by the documentation) or wait for a predefined time to submit an insert request, so it can balance the latency and throughput. My use case is to have to data as real time as possiblePK109
After going through the source code I figured Beam bigquery does create small batches. github.com/apache/beam/blob/master/sdks/java/io/… shows that in finishBundle method it tries to send multiple rows in a request. But this is class is marked internal (package level). There is no way to see how many it batches or how it throttles during varying rate of input. It would be ideal to see some explanation of how it behavesPK109
Sounds good. Will try to get something.Pablo
This might be taking a diversion, the source github.com/apache/beam/blob/master/sdks/java/io/… shows method writeAndGetErrors shows it applies a global window to the collection similar to what I have posted in the question. So that would override whatever WindowFn that has been applied before the write to BigQuery. What is the implication of this?PK109

1 Answers

0
votes

I don't know what you mean by micro-batch. The way I see it, BigQuery support loading data either as batches, either in streaming.

Basically, batch loads are subject to quotas and streaming loads are a bit more expensive.

Once you set the insertion method for your BigQueryIO the documentation states :

Note: If you use batch loads in a streaming pipeline, you must use withTriggeringFrequency to specify a triggering frequency.

Never tried it, but withTriggeringFrequency seems to be what you need here.