Is it possible to make map-only task execute parallelly in Apache Flink

Question

I'm using Flink to process some JSON-format streaming data:

{"uuid":"903493290432934", "bin": "68.3"}
{"uuid":"324938722984237", "bin": "56.8"}
...

My job is quite simple:

get stream from the Data Source ---> deserialize data into String ---> transform String to JSON object myJsonObj ---> double res = myJsonObj.get("bin") ---> do some heavy calculation with res.

Here is my code:

FlinkPravegaReader<String> source = ... // init source
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// transform String to MyJson
DataStream<MyJson> jsonStream = env.addSource(source).name("Pravega Stream")
    .map(new MapFunction<String, MyJson>() {
        @Override
        public MyJson map(String s) throws Exception {
            MyJson myJson = JSON.parseObject(s, MyJson.class);
            return myJson;
        }
    });
// do the heavy process
DataStream<String> heavyResult = jsonStream
    .map(new MapFunction<MyJson, String>() {
        @Override
        public String map(MyJson myJson) throws Exception {
            double res = myJson.get("bin");
            // do some very heavy calculation
            return myJson.get("uuid").asText() + " done.";
        }
    });
heavyResult.print();

As my understanding, I haven't used any keyBy/window, so I think I used windowAll by default. Am I right?

If I'm right, the doc of Flink told me that windowAll couldn't be run in the parallel way. So does it mean that I have to do the heavy calculation one by one? I'm thinking if it is possible to do the heavy calculation parallelly.

As you see, in my case, it doesn't seem that using keyBy/window makes any sense. So how to make this case execute parallelly? Is it possible to make two jobs running together with the same Data Source as below?

             /----windowAll ---- do the heavy calculation
            /
Data Source-
            \
             \----windowAll ---- do the heavy calculation

Is this design possible? Saying that the Data Source generates three elements: A and B. With this design, I'm expecting that one windowAll processes A while the other windowAll processes B.

Ezequiel Ezequiel · Accepted Answer · 2020-06-09T12:51:44

A keyed stream is used to create a partition in your data, so all the trafic from the same key is sent to thee same taskmanager.

A window is used when you want to aggregate elements from the stream to compute them as a set for a given reason.

If you case does not fit on above cases you don't use them.

To provide parallelism to the whole stream just use

final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(3);  //Notice you'll need 3 taskmanagers slots available.

To define paralelism for a single operator (heavy calculation) use:

DataStream<String> heavyResult = jsonStream
.map(new MapFunction<MyJson, String>() {
    @Override
    public String map(MyJson myJson) throws Exception {
        double res = myJson.get("bin");
        // do some very heavy calculation
        return myJson.get("uuid").asText() + " done.";
    }
}).setParallelism(3);  //Notice you'll need 3 taskmanagers slots available.

More info at https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/parallel.html

Is it possible to make map-only task execute parallelly in Apache Flink

1 Answers