Apache flink multi-threading/parallel execution

Question

The input stream consists of data in JSON array of objects format. Each object has one field/key named state by which we need to separate the input stream, see below example

Object1 -> "State":"Active"

Object2 -> "State":"Idle"

Object3 -> "State":"Blocked"

Object4 -> "State":"Active"

We have to start processing/thread as soon as we receive a particular state, keep on getting the data if a new state is similar to the previous state let the previous thread handle it else start a new thread for a new state. Also, it is required to run each thread for finite time and all the threads should run in parallel.

Please suggest how can I do it in Apache Flink. Pseudo codes and links would be helpful.

Is your processing logic different as per state? And is the value of 'State' only limited to one of the mentioned 4 or there can be multiple states dynamically? — shriyog
@narush Processing logic is same and the states are finite for now but the number can increase in future. — Kspace

shriyog shriyog · Accepted Answer · 2018-03-08T10:48:07

This can be done with Flink's Datastream API. Each JSON object can be treated as a tuple, which can be processed with any of the Flink Operators.

               /----- * *  | Active
------ (KeyBy) ------ *    | Idle
               \----- *    | Blocked

Now, you can split the single data stream into multiple streams using the KeyBy operator. This operator splits and clubs together, all the tuples with a particular key (State in your case) into a keyedstream which is processed in parallel. Internally, this is implemented with hash partitioning.

Any new keys(States) are dynamically handled as new keyedstreams are created for them.

Explore the documentation for implementation purpose.

Apache flink multi-threading/parallel execution

2 Answers