I'm trying to pass a partial function to the union of all the RDDs captured in a DStream batch over a sliding window. Lets say I construct a window operation over 10 seconds on a stream discretized into 1 second batches:
val ssc = new StreamingContext(new SparkConf(), Seconds(1))
val stream = ssc.socketStream(...)
val window = stream.window(Seconds(10))
My window
will have K many RDDs. I want to use collect(f: PartialFunction[T, U])
on the union of all K of these RDDs. I could call the union operator ++
using foreachRDD
, but I want to return an RDD
not a Unit
and avoid side effects.
What I'm looking for is a reducer like
def reduce(f: (RDD[T], RDD[T]) ⇒ RDD[T]): RDD[T]
on a DStream
that I can use like so:
window.reduce(_ ++ _).transform(_.collect(myPartialFunc))
But this is not available in the Spark Streaming API.
Does anyone have any good ideas for combining RDDs captured in a stream into a single RDD so I can pass in a partial function? Or for implementing my own RDD reducer? Perhaps this feature is coming in a subsequent Spark release?
compute
only accepts avalidTime
parameter. Is this the start or end of my window? Also, how will I deal with having to repeatedly callcompute
on the same interval as my batches? I'm looking for something less stateful. – nmurthycollect
on a DStream. Could you further explain what you're trying to do? There's probably another way. – maasgcollect
on the union of all the RDDs captured in one DSteram interval. There are two steps to what I'm trying to do: (1) reduce all the RDDs in one DStream interval with the++
operator into a single RDD, and then (2) callcollect
on my reduced RDD using a DStream transform. – nmurthycollect
afterwards?collect
is not much more than combiningfilter
andmap
that are available on theDStream
API - not sure why it's required to union the RDDs, though. – maasg