1
votes

Say I have some a streaming data of the schema as follows:

uid: string
ts: timestamp

Now assuming the data has been partitioned by uid (in each partition, the data is minimal, e.g. less than 1 row/sec).

I would like to put the data (in each partition) into windows based on event time ts, then sort all the elements within each window (based on ts as well), at last apply a custom transformation on each of the element in the window in order.

Q1: Is there any way to get an aggregated view of the window, but keep each element, e.g. materialize the all the elements in a window into a list?

Q2: If Q1 is possible, I would like to set a watermark and trigger combination, which triggers once at the end of the window, then either trigger periodically or trigger every time late data arrives. Is it possible?

1

1 Answers

0
votes

Before I answer the questions let me point out that Spark Structured Streaming offers KeyValueGroupedDataset.flatMapGroupsWithState (after Dataset.groupByKey) for arbitrary stateful streaming aggregation (with explicit state logic) that gives you the most for a manual streaming state management.


Q1: Is there any way to get an aggregated view of the window, but keep each element, e.g. materialize the all the elements in a window into a list?

That sounds like a streaming join where you have the input stream on your left and the aggregated stream (streaming aggregation) on your right. That should be doable (but leaving it with no example code as I'm still unsure if I understood your question right).

Q2: If Q1 is possible, I would like to set a watermark and trigger combination, which triggers once at the end of the window, then either trigger periodically or trigger every time late data arrives. Is it possible?

Use window standard function to define the window and a watermark to "close" windows at proper times. That is also doable (but no example again as I'm not sure of the merit of the question).