0
votes

My streaming flink job has checkpointing time of 2-3s(15-20% of time) and 3-4 mins(8-12% of time) and 2 mins on an average. We have two operators which are stateful. First is kafka consumer as source(FlinkKafkaConsumer010) and another is hdfs sink(CustomBucketingSink). This two makes state of around 1-1.5Gb for savepoints and 800mb-6Gb(3gb average) for checkpoint. We have 30sec of tumbling processing window. Checkpointing duration and minimum pause between two checkpoiting is 3 mins. My job consumes around 3 millions of records per minute on an average and around 20 millions/min records on peak time. There is more than enough cpu and memory for flink.

Now here are my doubts :

1) Even when few checkpointing state sizes are less(70-80% less) as compare to other checkpointing state, it takes minutes(15-20% of time) as compare to other one which takes 5-10 secs.

2) Buffer alignment size sometimes increases to 7-8gb as compare to 800mb-1gb average but checkpointing time is not affected by this. I guess it should take more time as it should wait for checkpoint barrier.

3) Will checkponting time be affected if we increase tumbling window size. I am considering it shouldn't affect neither savepoint time and nor checkpoint time.

4) Few of the sub-tasks which sinks into hdfs takes 2-3 mins (5-10% time). So while 98% of subtasks are completed in 30-50 secs. 1-2(95% of time, it's only one) subtasks takes 2-3 mins. Which delays the whole checkpointing time. Problem is not with the node on which this sub-tasks are running because it happens sometimes to some node and sometimes to another node.

5) We are getting one exception once every 6-8 hour which restarts the job. TimerException{java.nio.channels.ClosedByInterruptException} at org.apache.flink.streaming.runtime.tasks.SystemProcessingTimeService$TriggerTask.run(SystemProcessingTimeService.java:288)

6) How to minimize the alignment buffer time.

7) Savepoint time increases or decreases with increase and decrease of rate of input or state size but checkpointing time doesn't hold the same. Checkpointing time sometimes shows inverse relation with state size or we can saw it's not affected with the state size.

8) Whenever we restart the job, all sub-tasks take uniform time for 2-3 days on all nodes but afterwards 1-2 sub-tasks takes 2-3 minutes as compare to other which are taking 15-30 secs. I might be wrong on this behaviour but as far i have observed, this is also a case.

1

1 Answers

0
votes

Note that windows are stateful, and unless you are doing incremental aggregation, longer windows have more state, which will in turn affect checkpoint sizes and durations.

It would be helpful to know which state backend you are using, and whether or not you are using incremental checkpointing.

I would start by trying to find the cause of the slow sink subtask(s) causing the backpressure, which is in turn causing the painful checkpointing. Could be data skew, or resource starvation, for example. Some common causes include insufficient CPU, network, or disk bandwidth, or AWS (or other API) rate limits. It may seem that you have plenty of CPU, for example, but one hot key can put way too much load on one thread, and thereby hold back the entire cluster.

If you find a way to correct the imbalance at the sink, then the checkpoint alignment problems should calm down. (Note that if you can tolerate duplicate results, you could disable checkpoint barrier alignment by choosing CheckpointingMode.AT_LEAST_ONCE.)