Flink, basic rule for checkpointing?

Question

I have 2 questions regarding Flink checkpointing strategy,

I know that checkpoint is related to state (right?), so if I'm not using state (ValueState sort of things) explicitly in my job code, do I need to care about checkpoint? Is it still necessary?
If I need to enable the checkpointing, what should the interval be? Are there any basic rules for setting the interval? Suppose we're talking about a quite busy system (Kafka+Flink), like several billions messages per day.

Many thanks.

David Anderson David Anderson · Accepted Answer · 2019-03-07T14:12:11

Even if you are not using state explicitly in your application, Flink's Kafka source and sink connectors are using state on your behalf in order to provide you with either at-least-once or exactly-once guarantees -- assuming you care about those guarantees. Also, some other operators will also use state somewhat transparently, on your behalf, such as windows and other streaming aggregations.

If your Flink job fails, then it will be rewound back to the most recent successful checkpoint, and resume processing from there. So, for example, if your checkpoint interval is 10 minutes, then after recovery your job might have 10+ minutes of data to catch up on before it can resume processing live data. So choose a checkpoint interval that you can live with from this perspective.

Flink, basic rule for checkpointing?

1 Answers