3
votes

I am working on building an application with below requirements and I am just getting started with flink.

  • Ingest data into Kafka with say 50 partitions (Incoming rate - 100,000 msgs/sec)
  • Read data from Kafka and process each data (Do some computation, compare with old data etc) real time
  • Store the output on Cassandra

I was looking for a real time streaming platform and found Flink to be a great fit for both real time and batch.

  • Do you think flink is the best fit for my use case or should I use Storm, Spark streaming or any other streaming platforms?
  • Do I need to write a data pipeline in google data flow to execute my sequence of steps on flink or is there any other way to perform a sequence of steps for realtime streaming?
  • Say if my each computation take like 20 milliseconds, how can I better design it with flink and get better throughput.
  • Can I use Redis or Cassandra to get some data within flink for each computation?
  • Will I be able to use JVM in-memory cache inside flink?
  • Also can I aggregate data based on a key for some time window (example 5 seconds). For example lets say there are 100 messages coming in and 10 messages have the same key, can I group all messages with the same key together and process it.
  • Are there any tutorials on best practices using flink?

Thanks and appreciate all your help.

1

1 Answers

2
votes

Given your task description, Apache Flink looks like a good fit for your use case.

In general, Flink provides low latency and high throughput and has a parameter to tune these. You can read and write data from and to Redis or Cassandra. However, you can also store state internally in Flink. Flink does also have sophisticated support for windows. You can read the blog on the Flink website, check out the documentation for more information, or follow this Flink training to learn the API.