Is there anyway in spark streaming to keep data across multiple micro-batches of a sorted dstream, where the stream is sorted using timestamps? (Assuming monotonically arriving data) Can anyone make suggestions on how to keep data across iterations where each iteration is an RDD being processed in JavaDStream?
What does iteration mean?
I first sort the dstream using timestamps, assuming that data has arrived in a monotonically increasing timestamp (no out-of-order).
I need a global HashMap X, which I would like to be updated using values with timestamp "t1", and then subsequently "t1+1". Since the state of X itself impacts the calculations it needs to be a linear operation. Hence operation at "t1+1" depends on HashMap X, which depends on data at and before "t1".
Application
This is especially the case when one is trying to update a model or compare two sets of RDD's, or keep a global history of certain events etc which will impact operations in future iterations?
I would like to keep some accumulated history to make calculations.. not the entire dataset, but persist certain events which can be used in future DStream RDDs?