0
votes

I am unable to understand the use of states in Apache Flink. As far as my understanding, states maintain variables values during the execution of a Flink program. I think the same thing can be achieved through a class variable.

For instance, if I declare a class variable "someCounter" and increment its value in some Map function, then the "someCounter" value is retained during the course of the code execution, then why do we need an expensive state to maintain the similar values as mentioned in the example on this link: https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/state/state.html#tab_java_0 ?

static class abc extends RichMapFunction<X,Y> {
    long someCounter = 0;
    //ctor
    public  abc() {};

    @Override
    public Y map(X x) throws Exception {            
        someCounter++;
        if(someCounter > 1000)
            someCounter = 0;
        return someCounter;
    }
}
1

1 Answers

4
votes

Failure recovery, redeployments, and rescaling are some of the big differences.

Flink takes periodic checkpoints of the state it is managing. In the event of a failure, your job can automatically recover using the latest checkpoint, and resume processing. You can also manually trigger a state snapshot (called a savepoint in this case) and use it to restart after a redeployment. While you are at it, you can also rescale the cluster up or down.

You can also choose where your Flink state lives -- either as objects on the heap, or as serialized bytes on disk. Thus it is possible to have much more state than can fit in memory.

From an operational perspective, this is more like having your data in a database, than in memory. But from a performance perspective, it's more like using variables: the state is always local, available with high throughput and low latency.