I would like to use Spark Structured Streaming for an ETL job where each event is of form:
{
"signature": "uuid",
"timestamp: "2020-01-01 00:00:00",
"payload": {...}
}
The events can arrive late up to 30 days and can include duplicates. I would like to deduplicate them based on the "signature" field.
If I use the recommended solution:
streamingDf \
.withWatermark("timestamp", "30 days") \
.dropDuplicates("signature", "timestamp")
.write
would that track (keep in memory, store etc) a buffer of the full event content (which can be quite large) or will it just track the "signature" field values ?
Also, would the simple query like the above write new events immediately as new data arrives or would it "block" for 30 days?