4
votes

I have a flink job, which reads user events, uses session windows and writes back to kafka.

The state backend that I'm using is s3 (no hdfs cluster, just using the libs).

The problem is that the end to end checkpointing time keeps rising until checkpoints are dropped, and most of the time is spent on "Alignment".

The question is - why?, how can I solve this without setting the checkpointing mode to AT_LEAST_ONCE?

As you can see, the checkpoints duration keep going up

1
@rmetzger insights?ItayB
Have you looked to see if there is significant backpressure? I can see how that might cause this. ci.apache.org/projects/flink/flink-docs-release-1.3/monitoring/…David Anderson
@alpinegizmo yes, back pressure status is OKOmriManor
This is happening to me too (I'm using HDFS). Did you find the reason?insanely_sin
We're seeing the same issue, only instead of S3 we're using GCS. Did you find anything yet?Sander

1 Answers

0
votes

After looking further into the issue, this was due to high GC times (which occur frequently during checkpoints). We were using the FS state backend, while there is FS in it's name, that only refers to the output location of the checkpoint, while the entire state is still stored in memory (as opposed to rocksdb state backend).

We are still using FS state backend though, due to rocks-db high(er) latency, which we cannot permit in this application.