3
votes

This question was already asked here but since it's been two years I'm wondering if anything has changed.

I have a use-case in which I would like to share state between two Flink operators:

desired stream diagram

  • Stream A is the main stream, it flows continuously
  • Stream B is just a dataset of enrichment data. It's big (several GBs) and so will not fit in as a broadcast stream.
  • Stream B has an operator associated with it (FlatMap, but could be anything really) which acts as a state-loader and LOADS the enrichment data into RocksDB as a list-state.

    • Then, I connect the streams, where I would like to have access to the same state that was created in the enrichment stream.

Lastly, I know I can simply load the entire state AFTER the streams have been connected using a "co" function. It's just that from a software engineering point-of-view, separating the responsibilities into a "state-loader" class and a actual "data-enricher" class seems cleaner and so I'd just like to know if it's possible.

Thank you.

1

1 Answers

2
votes
  1. Actually it's hard to "simply load the entire state", in that you can't control the ordering of the load. Normally you'd want to completely load the enrichment data before processing any of the main stream (see FLIP-23).
  2. Leaving that aside, I wouldn't view it as "state-loading". Basically you're caching the enrichment data where it's needed (in the enrichment function).
  3. And finally, no, I don't know of an easy, built-in way in Flink to share state between operators. You can obviously use some external key-value store to enable this, but (a) that's extra infrastructure, and (b) it's not going to be as performant.