Flink reference data advice/best practice

Question

Looking for some advice on where to store/access Flink reference data. Use case here is really simple - I have a single column text file with a list of countries. I am streaming twitter data and then matching the countries from the text file based on the (parsed) Location field of the tweet. In the IDE (Eclipse) its all good as I have a static ArrayList populated when the routine fires up via a static Build method in my Flink Mapper (ie implements Flinks MapFunction). This class is now inner static as it gets shirt on serialization otherwise. Point is, when the overridden map function is invoked at runtime from within the stream, the static array of country data is their waiting, fully populated and ready to be matched against. Works a charm. BUT, when deployed into a Flink cluster ( and it took me to hell and back last week to actually get the code to FIND the text file), the array is only populated as part of the Build method. When it comes to being used the data has mysteriously disappeared and I am left with an array size of 0. (ergo, not a lot of matches get found. Thus, 2 questions - why does it work in Eclipse and not on deploy (renders a lot of Eclipse unit tests pointless as well). Or possibly just more generally, what is the right way to cross reference this kind of static, fixed reference data within Flink? (and in a way that it is found in both Eclipse and the cluster...)

David Anderson David Anderson · Accepted Answer · 2017-12-16T09:12:34

The standard way to handle static reference data is to load the data in the open method of a RichMapFunction or RichFlatMapFunction. Rich functions have open and close methods that are useful for creating and finalizing local state, and can access the runtime context.

Flink reference data advice/best practice

1 Answers