How can I identify the root cause for back pressure in a task? (i.e - which operator of a multi-operator-task is causing back pressure)
- Are there any relevant logs? (failed tracking StackTraceSampleCoordinator - "Received late stack trace sample" does not appear in any of the logs)
- Any other tools I can use?
=====================================
Here's what I've encoutered: During a Flink job execution a back pressure indication is being displayed. As I understand, the causing task is the one succeeding the "latest" task having a BP indication. This task is running a flow of multiple operators: reduce, map and a sink. Analyzing the jobs metrics does not help - what's getting out of preceding operator is what's getting inside this operator. Back pressure indication appears for the 1st and 2nd tasks of the the following job plan:
[Source: Custom Source -> Filter -> (Flat Map -> Timestamps/Watermarks)] ->
[Timestamps/Watermarks] ->
[TriggerWindow(TumblingEventTimeWindows(300000), ReducingStateDescriptor{serializer=org.apache.flink.api.java.typeutils.runtime.TupleSerializer@f812e02f, reduceFunction=EntityReducer@2d19244c}, EventTimeTrigger(), WindowedStream.reduce(WindowedStream.java:300)) -> Map -> Sink: Unnamed]
- where [] symbolizes a task.