What may cause long barrier alignment durations in Apache Flink jobs?

Question

I run my Flink job on YARN and I find a small number of subtasks encounter a long alignment duration.

What may probably cause this problem?

Hellen Hellen · Accepted Answer · 2018-08-14T02:33:41

For exactly-once semantics, Flink aligns the streams at operators that receive multiple input streams, hence large alignment means the task manager receives some barrier(s) later than the other nodes.

Document about alignment can be found here, and there are ways to monitor checkpointing

To be more specific, the reasons may be:

Data skew. Most of the data has been send to the large alignment duration node(s).
Garbage collection: GC will greatly affect the checkpoint alignment.
Long state access, i.e., take long time to put or get from state. For RocksDB, check whether there are index miss or cache miss problems.
Network buffers problem.
User code bug. For example, endless loop or other problems.

What may cause long barrier alignment durations in Apache Flink jobs?

1 Answers