I'm working with these two real time data stream framework processor. I've searched everywhere but I can't find big difference between these two framework. In particular I would like to know how they work based on size of data or topology etc.
1 Answers
The difference is mainly on the level of abstraction you have on processing streams of data.
Apache Storm is a bit more low level, dealing with the data sources (Spouts) and processors (Bolts) connected together to perform transformations and aggregations on individual messages in a reactive way.
There is a Trident API that abstracts a little from this low level message driven view, into more aggregated query like constructs, which makes things a bit easier to integrate together. (There is also an SQL-like interface for querying data streams, but it is still marked as experimental.)
From the documentation:
TridentState wordCounts =
topology.newStream("spout1", spout)
.each(new Fields("sentence"), new Split(), new Fields("word"))
.groupBy(new Fields("word"))
.persistentAggregate(new MemoryMapState.Factory(), new Count(), new Fields("count"))
.parallelismHint(6);
Apache Flink has a more functional-like interface to process events. If you are used to the Java 8 style of stream processing (or to other functional-style languages like Scala or Kotlin), this will look very familiar. It also has a nice web based monitoring tool. The nice thing about it is that it has built-in constructs for aggregating by time windows etc. (Which in Storm you can probably do too with Trident).
From the documentation:
DataStream<WordWithCount> windowCounts = text
.flatMap(new FlatMapFunction<String, WordWithCount>() {
@Override
public void flatMap(String value, Collector<WordWithCount> out) {
for (String word : value.split("\\s")) {
out.collect(new WordWithCount(word, 1L));
}
}
})
.keyBy("word")
.timeWindow(Time.seconds(5), Time.seconds(1))
.reduce(new ReduceFunction<WordWithCount>() {
@Override
public WordWithCount reduce(WordWithCount a, WordWithCount b) {
return new WordWithCount(a.word, a.count + b.count);
}
});
When I was evaluating the two, I went with Flink, simply because at that time it felt more well documented and I got started with it much more easily. Storm was slightly more obscure. There is a course on Udacity which helped me understand it much more, but in the end Flink still felt more fit for my needs.
You might also want to look at this answer here, albeit a bit old so both projects must have evolved since then.