3
votes

From https://spark.apache.org/docs/latest/streaming-programming-guide.html#window-operations it says that reduceByKeyAndWindow "Return a new single-element stream, created by aggregating elements in the stream over a sliding interval using func"

The example given was if we wanted generating word counts over the last 30 seconds of data, every 10 seconds.

The part that I am confused about it's how exactly does reduceByKeyAndWindow work. Because a windowed stream is composed of multiple RDDs. In this case wouldn't reduceByKeyAndWindow just return a stream of RDDs instead of one RDD?

1

1 Answers

6
votes

Spark Streaming is a microbatch based streaming library. What that means is that streaming data is divided into batches based on time slice called batch interval. Every batch gets converted into an RDD and this continous stream of RDDs is represented as DStream.

Sometimes we need to know what happened in last n seconds every m seconds. As a simple example, lets say batch interval is 10 seconds and we need to know what happened in last 60 seconds every 30 seconds. Here 60 seconds is called window length and 30 second slide interval. Lets say first 6 batches are A,B,C,D,E,F which are part of first window. After 30 seconds second window will ve formed which will have D,E,F,G,H,I. As you can see 3 batches are common between first and second window.

One thing to remember about window is that Spark holds onto the whole window in memory. In first window it will combine RDD A to F using union operator to create one big RDD. It is going to take 6 times memory and is ok if thats what you need.So In the ereduce key by window once it uninon the data in to one rdd it applies the reduce by key and return the dstream every sliding interval.