continuous deployment for stateful apache flink application on kubernetes

Question

I want to run an apache flink (1.11.1) streaming application on kubernetes. With a filesystem state backend saving to s3. Checkpointing to s3 is working

args:
  - "standalone-job"
    - "-s"
    - "s3://BUCKET_NAME/34619f2862ce3e5fc91d80eae13a434a/chk-4/_metadata"
    - "--job-classname"
    - "com.abc.def.MY_JOB"
    - "--kafka-broker"
    - "KAFKA_HOST:9092"

So the problem that I'm facing is:

I have to select the previous state dir manually. Is there a possibility to make it better?
The job increments the chk dir but it does not use the checkpoint. Means I throw a new event when I have seen an event for the first time and store it to a ListState<String> whenever I deploy via Gitlab a newer version of my application it again throws this event.
Why do I have to enable the checkpointing explicitly in my code when I have defined the state.backend to filesystem? env.enableCheckpointing(Duration.ofSeconds(60).toMillis()); and env.getCheckpointConfig().enableExternalizedCheckpoints(RETAIN_ON_CANCELLATION);

David Anderson David Anderson · Accepted Answer · 2020-08-06T08:35:55

You might be happier with Ververica Platform: Community Edition, which raises the level of abstraction to the point where you don't have to deal with the details at this level. It has an API that was designed with CI/CD in mind.
I'm not sure I understand your second point, but it's normal that your job will rewind and reprocess some data during recovery. Flink does not guarantee exactly once processing, but rather exactly once semantics: each event will affect the state being managed by Flink exactly once. This is done by rolling back to the offsets in the most recent checkpoint, and rolling back all of the other state to what it had been after consuming all of the data up to those offsets.
Having a state backend is necessary as a place to store your job's working state while the job is running. If you don't enable checkpointing, then the working state won't be checkpointed, and can not be recovered. However, as of Flink 1.11, you can enable checkpointing via the config file, using

execution.checkpointing.interval: 60000
execution.checkpointing.externalized-checkpoint-retention: RETAIN_ON_CANCELLATION

continuous deployment for stateful apache flink application on kubernetes

So the problem that I'm facing is:

2 Answers