I'm trying to drop late data from Structured Streaming dataset.
Using spark's withWatermark function doesn't help and the late data isn't drop.
My dataset is without aggregation on the event time column, so this is probably the reason, according to sparks internals, watermark is used for state management, but I want to use it for dropping the late data.
Is there any other way to force spark honor the watermark?
In the logs I see that the watermark was applied (i'm sending data before to update the watermark):
"eventTime" : {
"avg" : "2020-04-08T14:10:01.532Z",
"max" : "2020-04-12T02:10:01.532Z",
"min" : "2020-04-05T02:10:01.532Z",
"watermark" : "2020-04-09T02:00:01.532Z"
}
but the old events are still written to the results.