Apache Flink: How to execute in parallel but keep order of messages?

Question

I have a few questions regarding the parallelism of flink. This is my setup:

I have 1 master node and 2 slaves. In flink I have created 3 kafka consumers which each consume from a different topic.
Since the order of the elements is important to me, each topic only has one partition and I have flink setup to use the event time.

Then I run the following pipeline (in pseudo code) on each of the data streams:

source
.map(deserialize)
.window
.apply
.map(serialize)
.writeTo(sink)

Up until now I started my flink program with the argument -p 2 assuming that this would allow me to use both of my nodes. The result is not what I was hoping for, since the order of my output is messed up sometimes.

After reading through the flink documentation and trying to understand it better, could someone please confirm my following “learnings"?

1.) Passing -p 2 configures the task parallelism only, i.e. the maximum number of parallel instances a task (such as map(deserialize)) will be split into. If I want to keep the order through the whole pipeline I have to use -p 1.

2.) This to me seems contradictory/confusing: even if the parallelism is set to 1, different tasks can still be run in parallel (at the same time). Therefore my 3 pipelines will also be run in parallel if I pass -p 1.

And as a follow up question: Is there any way to figure out which tasks were mapped to which task slot so that I could confirm the parallel execution myself?

I would appreciate any input!

Update

Here is flink's execution plan for -p 2.

BenScape BenScape · Accepted Answer · 2017-05-01T16:34:14

After having asked the question on the Apache Flink user email list here is the answer:

1.) The -p option defines the task parallelism per job. If the parallelism is chosen higher than 1 and data gets redistributed (e.g. via rebalance() or keyBy()) the order is not guaranteed.

2.) With -p set to 1 only 1 task slot, i.e. 1 CPU Core, is used. Therefore there might be multiple threads running on one core concurrently but not in parallel.

As for my requirements: In order to run multiple pipelines in parallel and still keep the order I can just run multiple Flink Jobs instead of running all pipelines within the same Flink Job.

Apache Flink: How to execute in parallel but keep order of messages?

3 Answers