3
votes

What is the functionality of the queueStream function in Spark StreamingContext. According to my understanding it is a queue which queues the incoming DStream. If that is the case then how it is handled in the cluster with many node. Does each node will have this queueStream and the DStream is partitioned among all the nodes in the cluster? How does this queueStream work in cluster setup?

I have read below explanation in the [Spark Streaming documentation][https://spark.apache.org/docs/latest/streaming-programming-guide.html#basic-sources), but I didn't understand it completely. Please help me to understand it.

Queue of RDDs as a Stream: For testing a Spark Streaming application with test data, one can also create a DStream based on a queue of RDDs, using streamingContext.queueStream(queueOfRDDs). Each RDD pushed into the queue will be treated as a batch of data in the DStream, and processed like a stream.

val myQueueRDD= scala.collection.mutable.Queue[RDD[MyObject]]()
val myStream= ssc.queueStream(myQueueRDD)

for(count <- 1 to 100) {
        val randomData= generateData() //Generated random data
        val rdd= ssc.sparkContext.parallelize(randomData) //Creates the rdd of the random data.
        myQueueRDD+= rdd //Addes data to queue.
}

myStream.foreachRDD(rdd => rdd.mapPartitions(data => evaluate(data)))

How the above part of the code will get executed in the spark streaming context with respect to partitions on different nodes.

1

1 Answers

2
votes

QueueInputDStream is intended for testing. It uses standard scala.collection.mutable.Queue to store RDDs which imitate incoming batches.

Does each node will have this queueStream and the DStream is partitioned among all the nodes in the cluster

No. There is only one copy of the queue and all data distribution is handled by RDDs. compute logic is very simple with dequeue (oneAtATime set to true) or union of the current queue (oneAtATime set to false) at each tick. This applies to DStreams in general - each stream is just a sequence of RDDs, which provide data distribution mechanism.

While it still follows InputDStream API, conceptually it is just a local collection from which you take elements every batchDuration.