How Apache Flink deal with skewed data?

Question

For example, I have a big stream of words and want to count each word. The problem is these words is skewed. It means that the frequency of some words would be very high, but that of most other words is low. In storm, we could use the following way to solve this issue. First do shuffle grouping on the stream, in each node count words local in a window time, at the end update counts to cumulative results. From my another question, I know that Flink only supports window on a keyed stream, otherwise the window operation will not be parallel.

My question is is there a good way to solve this kind of skewed data issue in Flink?

in Storm I used "Partial Key grouping" and it handles better data skew. storm.apache.org/releases/2.0.0-SNAPSHOT/Concepts.html — Felipe

Fabian Hueske Fabian Hueske · Accepted Answer · 2016-01-11T09:41:11

Pre-aggregation is currently not natively supported by the DataStream API. In principle, it is possible to add a combiner-like feature for event-time windows. IMO, this would be a very valuable addition but hasn't been done yet.

However, you can implement this feature yourself. The DataStream API offers low-level operator interface which is similar to Storm Bolts. The interface is called OneInputStreamOperator. This operator type gives you full control. In fact, the built-in operators (such as Window operators) are also based on this class.

A OneInputStreamOperator can be applied like:

DataStream<Tuple2<String,Integer> inStream = ...
DataStream<String> outStream = inStream
  .transform("my op", BasicTypeInfo.STRING_TYPE_INFO, new MyOISO());

How Apache Flink deal with skewed data?

1 Answers