I have a Java DataFlow pipeline with the following parts:
- PubSub subscriber reading from several topics
- Flatten.pCollections operation
- Transform from PubsubMessage to TableRow
- BigQuery writer to write all to a dynamic table
When there's more than one PubSub-topic in the list of subscriptions to connect to, all elements get stuck in the GroupByKey operation in the Reshuffle operation within the BigQuery writer. I've let it run for a couple of hours after sending several dozen test messages, but nothing is written to BigQuery.
I found the following three work-arounds (each of them works separately from the others)
- Add a 'withTimestampAttribute' call on the Pubsub subscriptions. The name of the attribute does not matter at all - it can be any existing or non-existing attribute on the incoming messages
- Reduce the number of PubSub subscriptions to just 1
- Remove the Flatten.pCollections operation in between, creating multiple separate pipelines doing the exact same thing
The messages are not intentionally timestamped - writing them to BigQuery using just the PubsubMessage timestamp is completely acceptable.
It also confuses me that even adding a non-existing timestamp attribute seems to fix the issue. I debugged the issue to print out the timestamps within the pipeline, and they are comparable in both cases; when specifying a non-existing timestamp attribute, it seems to fall back to the pubsub timestamp anyway.
What could be causing this issue? How can I resolve it? For me, the most acceptable work-around is removing the Flatten.pCollections operation as it doesn't strictly complicate the code, but I can't get my head around the reason why it fails.