A good question for Spark experts.
I am processing data in a map
operation (RDD). Within the mapper function, I need to lookup objects of class A
to be used in processing of elements in an RDD.
Since this will be performed on executors AND creation of elements of type A
(that will be looked up) happens to be an expensive operation, I want to pre-load and cache these objects on each executor. What is the best way of doing it?
One idea is to broadcast a lookup table, but class
A
is not serializable (no control over its implementation).Another idea is to load them up in a singleton object. However, I want to control what gets loaded into that lookup table (e.g. possibly different data on different Spark jobs).
Ideally, I want to specify what will be loaded on executors once (including the case of Streaming, so that the lookup table stays in memory between batches), through a parameter that will be available on the driver during its start-up, before any data gets processed.
Is there a clean and elegant way of doing it or is it impossible to achieve?