GroupByKey node from BigQueryIO.writeTableRows() does not emit elements

Question

My streaming dataflow pipeline, which pulls data from PubSub, doesn't write away to BigQuery and logs no errors. The elements go into the node "Write to BigQuery/StreamingInserts/StreamingWriteTables/Reshuffle/GroupByKey":

which is created implicitly like this:

PCollection<TableRow> rows = ...;
rows.apply("Write to BigQuery",
    BigQueryIO.writeTableRows().to(poptions.getOutputTableName())
        .withSchema(...)
        .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
        .withMethod(BigQueryIO.Write.Method.STREAMING_INSERTS)
        .withFailedInsertRetryPolicy(InsertRetryPolicy.alwaysRetry())
        .withExtendedErrorInfo());

But elements never leave it, or at least not within the system lag which is now 45min. This is supposed to be a streaming job - how can I make it flush and write the data? This is beam version 2.13.0. Thank you.

UPDATE - StackDriver log (no errors) for the step to write data to BigQuery:

I can also add that this works if I use the DirectRunner in the cloud (but only for small amounts of data) and either runner if I insert row by row using the java interface to BigQuery (but that is at least two orders of magnitude too slow to start with).

Try run it with DirectRunner from your local machine. It will log out a more verbose error messages. Sometimes in cases likes this it helped for me. — jszule
Unfortunately not, as that works fine with no errors - it writes to the table. There is one other difference. When deployed, there is a PubSub read node added at the start of the pipeline, which I cannot easily test exactly on a local run (it results in large amounts of data), otherwise it is the same. When I deploy it in the cloud it doesn't work - and also without any error messages. — nsandersen
Did you create the dataset where you would like to write? Or do you have correct permissions to write there? — jszule
Yes, via Deployment Manager. I can write from the pipeline running on my local computer. We have other dataflow jobs deployed through Cloud Build, which can write. I don't think it gets as far as actually trying to write, I think it is still trying to order/organise data to do so efficiently (but obviously failing so far). — nsandersen
@nsandersen Did you solve it? I've seen similar issues when running a backfill and hence try to stream tablerows into a table that is partitioned by ingestion time and the tablerows belong to a partition that is older than 31 days. — nDakota

Andrew Pilloud Andrew Pilloud · Accepted Answer · 2019-07-30T22:17:09

You might try changing your retry policy to InsertRetryPolicy.retryTransientErrors(). The alwaysRetry() policy will cause the pipeline to appear to stop making progress if there is some configuration error, for example the BigQuery table not existing or not having permission to access it. The failures are always retried so they are never reported as failures.

You can also check the worker logs in Stackdriver Logging. Do this by clicking on the "Stackdriver" link in the upper corner of the step log pane. Full directions are in the Dataflow logging documentation.

GroupByKey node from BigQueryIO.writeTableRows() does not emit elements

1 Answers