This may be a basic question, but I am having some trouble understanding this.
I am currently using the Microsoft Azure Event Hubs Streaming in my Spark/Scala application which is similar to Kafka.
If I created a Unionized stream, I believe this unions multiple DStream objects abstracted to look like a single DStream, will the multiple RDDs in the stream be processed in parallel, or will each RDD be processed individually?
To try and explain this more, here is a quick example:
sparkConf.set(SparkArgumentKeys.MaxCores, (partitionCount * 2).toString)
val ssc = new StreamingContext(sparkConf, streamDuration)
val stream = EventHubsUtils.createUnionStream(ssc, hubParams, storageLevel)
stream.checkpoint(streamDuration)
val strings = stream.map(f => new String(f))
strings.foreachRDD(rdd => {
rdd.map(f => f.split(' '))
})
partitionCount is the number of partitions in the azure event hub.
- Does the initial "stream.map" perform on each RDD in parallel?
- Does "string.foreachRDD" process a single RDD at a time, or does it process all the RDDs in some parallel manner?