I need to enrich my fast changing streamA
keyed by (userId, startTripTimestamp) with slowly changing streamB
keyed by (userId).
I use Flink 1.8 with DataStream API. I consider 2 approaches:
Broadcast
streamB
and join stream by userId and most recent timestamp. Would it be equivalent of DynamicTable from the TableAPI? I can see some downsides of this solution:streamB
needs to fit into RAM of each worker node, it increase utilization of RAM as wholestreamB
needs to be stored in RAM of each worker.Generalise state of
streamA
to a stream keyed by just (userId), let's name itstreamC
, to have common key with thestreamB
. Then I am able to unionstreamC
withstreamB
, order by processing time, and handle both types of events in state. It's more complex to handle generaised stream (more code in the process function), but not consume that much RAM to have allstreamB
on all nodes. Are they any more downsides or upsides of this solution?
I have also seen this proposal https://cwiki.apache.org/confluence/display/FLINK/FLIP-17+Side+Inputs+for+DataStream+API where it is said:
In general, most of these follow the pattern of joining a main stream of high throughput with one or several inputs of slowly changing or static data:
[...]
Join stream with slowly evolving data: This is very similar to the above case but the side input that we use for enriching is evolving over time. This can be done by waiting for some initial data to be available before processing the main input and the continuously ingesting new data into the internal side input structure as it arrives.
Unfortunately, it looks like a long time ahead to reach this feature https://issues.apache.org/jira/browse/FLINK-6131 and no alternatives are described. Therefore I would like to ask of the currently recommended approach for the described use case.
I've seen Combining low-latency streams with multiple meta-data streams in Flink (enrichment), but it not specify what are keys of that streams, and moreover it is answered at the time of Flink 1.4, so I expect the recommended solution might have changed.