Flink: Broadcasted Operator chaining

Question

Assume that I have a Datastream of events and I want to broadcast it to a (rich) map operator(map1) that is chained to another (rich) map operator(map2). Parallelism of the two maps is the same. What I want is that the the output of each parallel instance of map1 go to one parallel instance of map2 (i.e., no broadcasting between the two maps). Here's what I've done so far but I'm not sure if it is logically correct. Is it Ok?

val trainedStream = events.broadcast.map(new Mapper1(...)).setParallelism(par)
trainedStream.startNewChain.map(new Mapper2(...)).setParallelism(par)

Followup Question: Is the SubtaskIndex (received from RuntimeContext.getIndexOfThisSubtask) of two chained subtasks/parallel instances of map1 and map2 the same? Is there a way to check this?

code is in Scala but the same applies for Java I guess

Arvid Heise Arvid Heise · Accepted Answer · 2020-03-11T08:20:37

Chaining happens automatically in Flink whenever possible. So, in your example, it's enough to just use

val trainedStream = events.broadcast.map(new Mapper1(...)).map(new Mapper2(...))

I'd set the parallelism on the env then.

Btw are you sure you want to broadcast the events? A Datastream is processed in parallel by default. It is very unusual to broadcast events, as they would be processed multiple times according to the parallelism.

Followup Question: Is the SubtaskIndex (received from RuntimeContext.getIndexOfThisSubtask) of two chained subtasks/parallel instances of map1 and map2 the same? Is there a way to check this?

subtask index is the same for chained operators as they reside in the same task (hence they cannot even have different indices). You can see that chaining was successful if you have a task mapper1 -> mapper2.

Flink: Broadcasted Operator chaining

1 Answers