1
votes

I can't seem to find any documentation about this. I have an apache-beam pipeline that takes some information, formats it into TableRows and then writes to BigQuery.

[+] The problem:

The rows are not written to BigQuery until the Dataflow job finishes. If I have a Dataflow job that takes a long time I'd like to be able to see the rows being inserted into BigQuery, can anybody point me the right direction?

Thanks in advance

2
The behavior of the pipeline is determined by many factors. It is hard to answer specifically without more details., For example, what is the source of data? Can the source be buffering all the data? Is the source bounded/unbounded? What are the windowing settings of your pipeline? What are the triggering settings? Did you try other settings for windowing/triggering, sources/sinks? Can you share the code you have?Anton
I'm sorry Anton, unfortunately I cannot share the code because of work policy but my source is bounded. I am doing a batch job with no windowing. I have only used the BigQueryIO.writeTableRows() sink.user9773014
As far as the when data gets where, I have a PTransform that takes a while to do it's thing but is outputting data very quickly...that data is what is going through another PTransform that converts it to TableRows and then finally sends it to the Write Transform.user9773014
It sounds impossible in your setup in batch mode. One of the workarounds I can think of is to switch to streaming mode (--streaming or corresponding pipelineOption flag for Dataflow runner) and add triggering (e.g. GlobalWindow+trigger every xx seconds.Anton
Just an aside, this will incur extra costs since you would be streaming into a table. (cloud.google.com/bigquery/pricing#streaming_pricing)CasualT

2 Answers

1
votes

Since you're working in batch mode, data need to be written at the same time in the same table. If you're working with partitions, all data belonging to a partition need to be written at the same time. That's why the insertion is done last.

Please note that the WriteDisposition is very important when you work in batches because either you append data, or truncate. But does this distinction make sense for streaming pipelines ?

In java, you can specify the Method of insertion with the following function :

.withMethod(BigQueryIO.Write.Method.STREAMING_INSERTS))

I've not tested it, but I believe it should work as expected. Also note that streaming inserts to BigQuery are not free of charge.

0
votes

Depending on how complex your initial transform+load operation is, you could just use the big query driver to do streaming inserts into the table from your own worker pool, rather than loading it in via a dataflow job explicitly.

Alternatively, you could do smaller batches:

  • N Independent jobs each loading TIME_PERIOD/N amounts of data