In CouchDB and system designs like Incoop, there's a concept called "Incremental MapReduce" where results from previous executions of a MapReduce algorithm are saved and used to skip over sections of input data that haven't been changed.
Say I have 1 million rows divided into 20 partitions. If I run a simple MapReduce over this data, I could cache/store the result of reducing each separate partition, before they're combined and reduced again to produce the final result. If I only change data in the 19th partition then I only need to run the map & reduce steps on the changed section of the data, and then combine the new result with the saved reduce results from the unchanged partitions to get an updated result. Using this sort of catching I'd be able to skip almost 95% of the work for re-running a MapReduce job on this hypothetical dataset.
Is there any good way to apply this pattern to Spark? I know I could write my own tool for splitting up input data into partitions, checking if I've already processed those partitions before, loading them from a cache if I have, and then running the final reduce to join all the partitions together. However, I suspect that there's an easier way to approach this.
I've experimented with checkpointing in Spark Streaming, and that is able to store results between restarts, which is almost what I'm looking for, but I want to do this outside of a streaming job.
RDD caching/persisting/checkpointing almost looks like something I could build off of - it makes it easy to keep intermediate computations around and reference them later, but I think cached RDDs are always removed once the SparkContext is stopped, even if they're persisted to disk. So caching wouldn't work for storing results between restarts. Also, I'm not sure if/how checkpointed RDDs are supposed to be loaded when a new SparkContext is started... They seem to be stored under a UUID in the checkpoint directory that's specific to a single instance of the SparkContext.