2
votes

I'm trying to drop late data from Structured Streaming dataset.

Using spark's withWatermark function doesn't help and the late data isn't drop.

My dataset is without aggregation on the event time column, so this is probably the reason, according to sparks internals, watermark is used for state management, but I want to use it for dropping the late data.

Is there any other way to force spark honor the watermark?

In the logs I see that the watermark was applied (i'm sending data before to update the watermark):

"eventTime" : {
"avg" : "2020-04-08T14:10:01.532Z",
"max" : "2020-04-12T02:10:01.532Z",
"min" : "2020-04-05T02:10:01.532Z",
"watermark" : "2020-04-09T02:00:01.532Z"
}

but the old events are still written to the results.

1
No there is not. u need to think outside of the box. a few let downs here and there.thebluephantom

1 Answers

1
votes

I ran into same problem.I see in the document that its not guaranteed that Spark will drop old data.Here is the snapshot of official document