0
votes

My Question in regarding iteration over multiple streams in Apache Flink.

I am a Flink beginner, and I am currently trying to execute a recursive query (e.g., datalog) on Flink.

For example, a query calculates the transitive closure for every 5mins (tumbling window). If I have one input stream inputStream (consists of initial edge informations), another outputStream (the transitive closure) which is initialised by the inputStream. And I want to iteratively enrich the outputStream by joining the inputStream. For each iteration, the feedback should be the outputStream, and the iteration will last until no more edge can be appended on outputStream. The computation of my transitive closure should trigger periodically for every 5 mins. During the iteration, the inputStream should be "hold" and provide the data for my outputStream.

Is it possible to do this in Flink? Thanks for any help!

1

1 Answers

0
votes

This sounds like a side-input issue, where you want to treat the "inputStream" as a batch dataset (with refresh) that's joined to the other "outputStream". Unfortunately Flink doesn't provide an easy way to implement that currently (see https://stackoverflow.com/a/48701829/231762)

If both of these streams are coming from data sources, then one approach is to create a wrapper source that controls the ordering of the records. It would have to emit something like a Tuple2 where one side or the other is null, and then in a downstream (custom) Function you'd essentially split these, and do the joining.

If that's possible, then this source can block the "output" tuples while it emits the "input" tuples, plus other logic it sounds like you need (5 minute refresh, etc). See my response to the other SO issue above for skeleton code that does this.