Can someone please explain me how Spark Streaming executes the window() operation? From the Spark 1.6.1 documentation, it seems that windowed batches are automatically cached in memory, but looking at the web UI it seems that operations already executed in previous batches are executed again. For your convenience, I attach a screenshot of my running application below:
By looking at the web UI, it seems that flatMapValues() RDDs are cached (green spot - this is the last operation executed before I call window() on the DStream), but, at the same time, it also seems that all the transformations that led to flatMapValues() in previous batches are executed again. If this is the case, the window() operation may induce huge performance penalties, especially if the window duration is 1 or 2 hours (as I expect for my application). Do you think that checkpointing the DStream at that time can be helpful? Consider that the expected slide window is about 5 minutes.
Hope someone can clarify this point.
EDIT
I add a code snippet. Stream1 and Stream2 are data feeds read from HDFS
JavaPairDStream<String, String> stream1 = cdr_orig.mapToPair(parserFunc)
.flatMapValues(new Function<String, Iterable<String>>() {
@Override
public Iterable<String> call(String s) {
return Arrays.asList(s.split(","));
}
}).window(Durations.seconds(WINDOW_DURATION), Durations.seconds(SLIDE_DURATION));
JavaPairDStream<String, String> join = stream2.join(stream1);
The two streams are produced periodically by another system. These streams are asynchronous, which means that records in stream2 at time t appear in stream1 at time t'<=t. I'm using window() to cache stream1 records for 1-2 hours, but this can be inefficient if the operations on past batches of stream1 will be executed at every new batch.
