5
votes

In my application, I want to enrich an infinite stream of events. The stream itself is parallelised by hashing of an Id. For every event, there might be a call to an external source (e.g. REST, DB). This call is blocking by nature. The order of events within one stream partition must be maintained.

My idea was to create a RichMapFunction, which sets up the connection and then polls the external source for each event. The blocking call usually takes not to long, but in the worst case, the service could be down.

Theoretically, this works, but I don't feel good doing it this way, as I don't know how Flink reacts if you have some blocking operations within the stream. And what happens if you have a lot of parallel streams blocking, i.e. am I running out of threads? Or how is the behavior stream-upwards at the point where the stream is parallelised?

Does someone else may have a similar issue and an answer to my question or some ideas how to tackle it?

1
You will not run out-of-threads. If you rests blocks, the operator is not able to make progress, Thus, back pressure will apply finally slowing down your source operators. - Matthias J. Sax
Thanks for the answer. My concern was more about if its the right approach to do such an data enrichment. - peterschrott
Hard to answer. I would assume that you pay a throughput penalty and Flink was never designed for such a use case -- Flink allows for high throughput because it decouple processing as much as possible what is contradicted if you block. But I doubt that you will "break" the system by it. - Matthias J. Sax

1 Answers

5
votes

RichMapFunction is a good starting point but prefer RichAsyncFunctionwhich is asynchronous and which not block your processing !

Careful :
1- your DB access but also be asynchronous
2- your event order may change (according to the mode used)

More details : https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/stream/asyncio.html

Hope it helps