2
votes

How can I identify the root cause for back pressure in a task? (i.e - which operator of a multi-operator-task is causing back pressure)

  • Are there any relevant logs? (failed tracking StackTraceSampleCoordinator - "Received late stack trace sample" does not appear in any of the logs)
  • Any other tools I can use?

=====================================

Here's what I've encoutered: During a Flink job execution a back pressure indication is being displayed. As I understand, the causing task is the one succeeding the "latest" task having a BP indication. This task is running a flow of multiple operators: reduce, map and a sink. Analyzing the jobs metrics does not help - what's getting out of preceding operator is what's getting inside this operator. Back pressure indication appears for the 1st and 2nd tasks of the the following job plan:

[Source: Custom Source -> Filter -> (Flat Map -> Timestamps/Watermarks)] -> [Timestamps/Watermarks] -> [TriggerWindow(TumblingEventTimeWindows(300000), ReducingStateDescriptor{serializer=org.apache.flink.api.java.typeutils.runtime.TupleSerializer@f812e02f, reduceFunction=EntityReducer@2d19244c}, EventTimeTrigger(), WindowedStream.reduce(WindowedStream.java:300)) -> Map -> Sink: Unnamed]

  • where [] symbolizes a task.
1

1 Answers

1
votes

In the Flink UI, backpressure for a task indicates that the task's call to collect() is blocking. So if tasks 1 & 2 in your example have backpressure, then it's likely something in task 3 that is not keeping up with your source.

Note that if your source is synthesizing events without delay, but you have a real sink, then you'll always see backpressure as the sink becomes the bottleneck. Details on your actual source & sink would be useful here.

To dig deeper into what's happening inside of task 3, you can hook up something like YourKit to monitor actual CPU usage for the various (pipelined) operations in that task. Or just kill -QUIT <taskmanager pid> a few times, to see which threads are blocked/doing real work.