Apache Flink: Using filter() or split() to split a stream?

Question

I have a DataStream from Kafka which has 2 possible value for a field in MyModel. MyModel is a pojo with domain-specific fields parsed from a message from Kafka.

DataStream<MyModel> stream = env.addSource(myKafkaConsumer);

I want to apply window and operators on each key a1, a2 separately. What is a good way to separate them? I have 2 options filter and select in mind but don't know which one is faster.

Filter approach

stream
        .filter(<MyModel.a == a1>)
        .keyBy()
        .window()
        .apply()
        .addSink()

stream
        .filter(<MyModel.a == a2>)
        .keyBy()
        .window()
        .apply()
        .addSink()

Split and select approach

SplitStream<MyModel> split = stream.split(…)
    split
        .select(<MyModel.a == a1>)
        …
        .addSink()

    split
        .select<MyModel.a == a2>()
        …
        .addSink()

If split and select are better, how to implement them if I want to split based on the value of a field in MyModel?

I'd recommend NOT use split. Maybe I did something wrong but in my current project env.execute() throws weird exception when using split. Then I replaced split with filter and it solved my problem. — bolei

Fabian Hueske Fabian Hueske · Accepted Answer · 2018-12-03T09:35:46

Both methods behave pretty much the same. Internally, the split() operator forks the stream and applies filters as well.

There is a third option, Side Outputs . Side outputs might have some benefits, such as different output data types. Moreover, the filter condition is just evaluated once for side outputs.

Apache Flink: Using filter() or split() to split a stream?

2 Answers