Window operation on Spark streaming from Kafka

Question

I am trying to explore Spark streaming from Kafka as the source. As per this link, createDirectStream has 1:1 parallelism between kafka partitions and Spark. So this would mean that, if there is a Kafka topic with 3 partitions then 3 spark executors would run parallel, each reading a partition.

Questions

Suppose i have a window operation after the data is read. Does the window operation apply window across partitions or within one partition i.e. lets say my batch interval is 10s and window interval is 50s. Does window accumulate data for 50s of data across partitions (if each partition has 10 records each for 50s, does window hold 30 records) or 50s of data per partition in parallel (if each partition has 10 records each for 50s, does window hold 10 records)?

pseudo code:

rdd = createDirectStream(...)

rdd.window()

rdd.saveAsTextFile() //Does this write 30 records in 1 file or 3 files with 10 records per file?

Suppose i have this...

Pseudo code:

rdd = createDirectStream()

rdd.action1()

rdd.window()

rdd.action2()

Lets say, i have 3 kafka partitions and 3 executors (each reading a topic). This spins 2 jobs as there are 2 actions. Each spark executor would have partition of the RDD and action1 is applied in parallel. Now for action2, would the same set of executors be used (otherwise, the data has to be read from Kafka again - not good)?

maasg maasg · Accepted Answer · 2017-06-21T09:25:03

Q) if there is a Kafka topic with 3 partitions then 3 spark executors would run parallel, each reading a partition.

In more specific terms, there will be 3 tasks submitted to the Spark cluster, one for each partition. Where these tasks execute depend on your cluster topology and locality settings but in general you can consider that these 3 tasks will run in parallel.

Q) Suppose I have a window operation after the data is read. Does the window operation apply window across partitions or within one partition?

The fundamental model of Spark and by transitivity of Spark Streaming is that operations are declared on an abstraction (RDD/Datasets for Spark, DStream for Spark Streaming) and at the execution level, those operations will be applied in a distributed fashion, using the native partitioning of the data.

((I'm not sure about the distinction the question makes between "across partitions or within one partition". The window will be preserved per partition. The operation(s) will be applied according to their own semantics. For example, a map operation will be applied per partition, while a count operation will be first applied to each partition and then consolidated to one result.))

Regarding the pseudo code:

val dstream = createDirectStream(..., Seconds(30))

dstream.window(Seconds(600)) // this does nothing as the new dstream is not referenced any further

val windowDstream = dstream.window(timePeriod) // this creates a new Windowed DStream based on the base DStream 

dstream.saveAsTextFiles() // this writes using the original streaming interval (30 seconds). It will write 1 logical file in the distributed file system with 3 partitions

windowDstream.saveAsTextFiles() // this writes using the windowed interval (600 seconds). It will write 1 logical file in the distributed file system with 3 partitions.

Given this code (note naming changes!):

val dstream = createDirectStream(...)

dstream.action1()

val windowDStream = dstream.window(...)

windowDStream.action2()

for action2, would the same set of executors be used (otherwise, the data has to be read from Kafka again - not good)?

In the case of Direct Stream model, the RDDs at each interval do not contain any data, only offsets (offset-start, offset-end). It's only when an action is applied that the data is read.

A windowed dstream over a direct producer is, therefore, just a series of offsets: Window (1-3) = (offset1-start, offset1-end), (offset2-start, offset2-end), (offset3-start, offset3-end). When an action is applied to that window, these offsets will be fetched from Kafka and the operation will be applied. This is not "bad" as implied in the question. This prevents us from having to store intermediate data for long periods of time and lets us preserve operation semantics on the data.

So, yes, the data will be read again, and that's a good thing.

Window operation on Spark streaming from Kafka

1 Answers