Dataflow job stuck and not reading messages from PubSub

Question

I have a dataflow job which reads JSON from 3 PubSub topics, flattening them in one, apply some transformations and save to BigQuery.

I'm using a GlobalWindow with following configuration.

.apply(Window.<PubsubMessage>into(new GlobalWindows()).triggering(AfterWatermark.pastEndOfWindow()
                            .withEarlyFirings(AfterFirst.of(AfterPane.elementCountAtLeast(20000),
                                    AfterProcessingTime.pastFirstElementInPane().plusDelayOf(durations))))
                            .discardingFiredPanes());

The job is running with following configuration

Max Workers : 20
Disk Size: 10GB
Machine Type : n1-standard-4
Autoscaling Algo: Throughput Based

The problem I'm facing is that after processing few messages (approx ~80k) the job stops reading messages from PubSub. There is a backlog of close to 10 Million messages in one of those topics and yet the Dataflow Job is not reading the messages or autoscaling.

I also checked the CPU usage of each worker and that is also hovering in single digit after initial burst.

I've tried changing machine type and max worker configuration but nothing seems to work.

How should I approach this problem ?

Would it be possible for you to restart the pipeline and check if the issue is recurrent? — Alexandre Moraes
yes, I tried restarting the job 5 times but every time its getting stuck at 50-60K mark. — kaysush

Daniel Oliveira Daniel Oliveira · Accepted Answer · 2020-02-08T00:54:39

I suspect the windowing function is the culprit. GlobalWindow isn't suited to streaming jobs (which I assume this job is, due to the use of PubSub), because it won't fire the window until all elements are present, which never happens in a streaming context.

In your situation, it looks like the window will fire early once, when it hits either that element count or duration, but after that the window will get stuck waiting for all the elements to finally arrive. A quick fix to check if this is the case is to wrap the early firings in a Repeatedly.forever trigger, like so:

withEarlyFirings(
    Repeatedly.forever(
        AfterFirst.of(
            AfterPane.elementCountAtLeast(20000),
            AfterProcessingTime.pastFirstElementInPane().plusDelayOf(durations)))))

This should allow the early firing to fire repeatedly, preventing the window from getting stuck.

However for a more permanent solution I recommend moving away from using GlobalWindow in streaming pipelines. Using fixed-time windows with early firings based on element count would give you the same behavior, but without risk of getting stuck.

Dataflow job stuck and not reading messages from PubSub

1 Answers