2
votes

I have a Flink project that receives an events streams, and executes some logic to add a flag of this event, then it saves the flag and the eventID for a while to be reused or to be queried by other system.

in this case, the volume of data is not too many, and need to be good reliability, of course, better to be updated in time before being used.

Traditionally, we can use an external database to save this kind of data. But after I learned the state, I saw it seems to be very useful, and has a good backends mechanism, and can be queryable.

So I am asking question to listen more to your arguments and evidence.

1
Not exactly the answer you are looking for, but I would recommend watching the keynotes that Netflix gave at the Flink Forward conference in San Francisco in 2017. I remember they use extremely large state in this regard, but it also will give you a sense of what all they use state for and what they use other external databases for. Also it has been a while since I have seen the video so let me know if they don't cover what you are looking for, but I am 95% sure it will. Netflix: youtube.com/…Jicaar
hey, @Jicaar Thank for your reply, I had a brief look at the video, I didn't find yet, I will try to study more later, thank you any wayLeyla Lee

1 Answers

2
votes

I am moving my last two comments to here as an answer since I realized I am essentially doing that.

Ok, It might have been the Uber keynote then. But the bottom line is that there are companies that are using extremely large state to hold data that you need to perform calculations against effectively.

For example, I made a program that took in messages that with an unique ID and a value field(int). I then had a stateful function that was keyed by the ID of the received message and every message I received for that ID would be added to a stateful value object, updating the the total for that ID. You could make a stateful list object to hold all the messages you received if you needed that. An alternative to that is to use a "new age" database that is designed for quick read/writes, like Cassandra, to store that. But that approach comes with its own limitations because of the I/O (long story short, Flink and Cassandra could handle lots of dat fast, the network bandwidth could not).

So keeping all that data in state in flink can be done and used well and has many benefits.

The one thing that I have to caveat this with is that I do not know if Flink's state has the same sort of failsafes like that of Cassandra or Kafka. Whereas they replicate their data across nodes so that if one goes down, then the others can handle everything and repopulate the other node when it is restarted. Flink's state can be stored on a remote backend like an s3 bucket or hdfs (see: https://ci.apache.org/projects/flink/flink-docs-release-1.4/ops/state/state_backends.html), but I do not know if there is replication of the state. So if the state is stored all on one node that goes down, if it is gone for good or is backed up on another node. That is something to look into more since that should be a big decision in your choice.

Hope that at least gave you some info and a brief idea of what questions to ask.