Messages stuck in GBP when streaming from multiple PubSub topics to BigQuery using DataFlow?

Question

I have a Java DataFlow pipeline with the following parts:

PubSub subscriber reading from several topics
Flatten.pCollections operation
Transform from PubsubMessage to TableRow
BigQuery writer to write all to a dynamic table

When there's more than one PubSub-topic in the list of subscriptions to connect to, all elements get stuck in the GroupByKey operation in the Reshuffle operation within the BigQuery writer. I've let it run for a couple of hours after sending several dozen test messages, but nothing is written to BigQuery.

I found the following three work-arounds (each of them works separately from the others)

Add a 'withTimestampAttribute' call on the Pubsub subscriptions. The name of the attribute does not matter at all - it can be any existing or non-existing attribute on the incoming messages
Reduce the number of PubSub subscriptions to just 1
Remove the Flatten.pCollections operation in between, creating multiple separate pipelines doing the exact same thing

The messages are not intentionally timestamped - writing them to BigQuery using just the PubsubMessage timestamp is completely acceptable.

It also confuses me that even adding a non-existing timestamp attribute seems to fix the issue. I debugged the issue to print out the timestamps within the pipeline, and they are comparable in both cases; when specifying a non-existing timestamp attribute, it seems to fall back to the pubsub timestamp anyway.

What could be causing this issue? How can I resolve it? For me, the most acceptable work-around is removing the Flatten.pCollections operation as it doesn't strictly complicate the code, but I can't get my head around the reason why it fails.

MrtN MrtN · Accepted Answer · 2018-08-23T11:51:11

Did you apply windowing to your pipeline? The Beam documentation warns you about using a unbounded PCollection (like Pub/Sub) without any windowing or triggering:

If you don’t set a non-global windowing function or a non-default trigger for your unbounded PCollection and subsequently use a grouping transform such as GroupByKey or Combine, your pipeline will generate an error upon construction and your job will fail.

In your case the pipeline does not fail on construction, but the messages are stuck in the GroupByKey, because it is waiting for the window to end. Try adding a window before the BigQuery writer and see if that solves the problem.

Messages stuck in GBP when streaming from multiple PubSub topics to BigQuery using DataFlow?

1 Answers