Description:
Currently I am working on using Flink with an IOT setup. Essentially, devices are sending data such as (device_id, device_type, event_timestamp, etc) and I don't have any control over when the messages get sent. I then key the steam by device_id and device_type to preform aggregations. I would like to use event-time given that is ensures the timers which are set trigger in a deterministic nature given a failure. However, given that this isn't always a high throughput stream a window could be opened for a 10 minute aggregation period, but not have its next point come until approximately 40 minutes later. Although the calculation would aggregation would eventually be completed it would output my desired result extremely late.
So my work around for this is to create an additional external source that does nothing other than pump fake messages. By having these fake messages being pumped out in alignment with my 10 minute aggregation period, even if a device hadn't sent any data, the event time windows would have something to force the windows closed. The critical part here is to make it possible that all parallel instances / operators have access to this fake message because I need to close all the windows with this single fake message. I was thinking that Broadcast state might be the most appropriate way to accomplish this goal given: "Broadcast state is replicated across all parallel instances of a function, and might typically be used where you have two streams, a regular data stream alongside a control stream that serves rules, patterns, or other configuration messages." Quote Source
Questions:
- Is broadcast state the best method for ensuring all parallel instances (e.g. windows) receive my fake messages?
- Once the operators have access to this fake message via the broadcast state can this fake message then be used to advance the event time watermark?