I have a few questions regarding the parallelism of flink. This is my setup:
I have 1 master node and 2 slaves. In flink I have created 3 kafka consumers which each consume from a different topic.
Since the order of the elements is important to me, each topic only has one partition and I have flink setup to use the event time.
Then I run the following pipeline (in pseudo code) on each of the data streams:
source
.map(deserialize)
.window
.apply
.map(serialize)
.writeTo(sink)
Up until now I started my flink program with the argument -p 2
assuming that this would allow me to use both of my nodes. The result is not what I was hoping for, since the order of my output is messed up sometimes.
After reading through the flink documentation and trying to understand it better, could someone please confirm my following “learnings"?
1.) Passing -p 2
configures the task parallelism only, i.e. the maximum number of parallel instances a task (such as map(deserialize)
) will be split into. If I want to keep the order through the whole pipeline I have to use -p 1
.
2.) This to me seems contradictory/confusing: even if the parallelism is set to 1, different tasks can still be run in parallel (at the same time). Therefore my 3 pipelines will also be run in parallel if I pass -p 1
.
And as a follow up question: Is there any way to figure out which tasks were mapped to which task slot so that I could confirm the parallel execution myself?
I would appreciate any input!
Update
Here is flink's execution plan for -p 2
.