I have data coming in on Kafka from IoT devices. The timestamps of the sensor data of these devices are often not in sync due to network congestion, device being out of range, etc.
We have to write streaming jobs to aggregate sensor values over a window of time for each device independently. With the groupby with watermark operation, we lose the data of all devices that lag behind device with latest timestamp.
Is there any way that the watermarks could be applied separately to each device based on the latest timestamp for that device, and not the latest timestamp across all devices?
We cannot keep a large lag as the device could be out of range for days. We cannot run an individual query for each device as the number of devices is high.
Would it be achievable using flatMapGroupsWithState? Or is this something that cannot be achieved with Spark Structured Streaming at all?