0
votes

The input stream consists of data in JSON array of objects format. Each object has one field/key named state by which we need to separate the input stream, see below example

Object1 -> "State":"Active"

Object2 -> "State":"Idle"

Object3 -> "State":"Blocked"

Object4 -> "State":"Active"

We have to start processing/thread as soon as we receive a particular state, keep on getting the data if a new state is similar to the previous state let the previous thread handle it else start a new thread for a new state. Also, it is required to run each thread for finite time and all the threads should run in parallel.

Please suggest how can I do it in Apache Flink. Pseudo codes and links would be helpful.

2
Is your processing logic different as per state? And is the value of 'State' only limited to one of the mentioned 4 or there can be multiple states dynamically?shriyog
@narush Processing logic is same and the states are finite for now but the number can increase in future.Kspace

2 Answers

0
votes

This can be done with Flink's Datastream API. Each JSON object can be treated as a tuple, which can be processed with any of the Flink Operators.

               /----- * *  | Active
------ (KeyBy) ------ *    | Idle
               \----- *    | Blocked

Now, you can split the single data stream into multiple streams using the KeyBy operator. This operator splits and clubs together, all the tuples with a particular key (State in your case) into a keyedstream which is processed in parallel. Internally, this is implemented with hash partitioning.

Any new keys(States) are dynamically handled as new keyedstreams are created for them.

Explore the documentation for implementation purpose.

0
votes

From your description, I believe you'd need to first have an operator with a parallelism of 1, that "chunks" events by the state, and adds a "chunk id" to the output record. Whenever you get an event with a new state, you'd increment the chunk id.

Then key by the chunk id, which will parallelize downstream processing. Add a custom function which is keyed by the chunk id, and has a window duration of 10 minutes. This is where the bulk of your data processing will occur.

And as @narush noted above, you should read through the documentation that he linked to, so you understand how windows work in Flink.