Spark Streaming from Kafka Source Go Back to a Checkpoint or Rewinding

Question

When streaming Spark DStreams as a consumer from a Kafka source, one can checkpoint the spark context so when the app crashes (or is affected by a kill -9), the app can recover from the context checkpoint. But if the app is 'accidentally deployed with bad logic', one might want to rewind to the last topic+partition+offset to replay events from a certain Kafka topic's partitions' offset positions that were working fine before the 'bad logic'. How are streaming apps rewound to the last 'good spot' (topic+partition+offset) when checkpointing is in effect?

Note: In I (Heart) Logs, Jay Kreps writes about using a parallel consumer (group) process that starts at the diverging Kafka offset locations until caught up with the original and then killing the original. (What does this 2nd Spark streaming process look like with respect to the starting from certain partition/offset locations?)

Sidebar: This question may be related to Mid-Stream Changing Configuration with Check-Pointed Spark Stream as a similar mechanism may need to be deployed.

David Griffin David Griffin · Accepted Answer · 2016-04-26T11:19:34

You are not going to be able to rewind a stream in a running SparkStreamingContext. It's important to keep these points in mind (straight from the docs):

Once a context has been started, no new streaming computations can be set up or added to it.

Once a context has been stopped, it cannot be restarted.

Only one StreamingContext can be active in a JVM at the same time.

stop() on StreamingContext also stops the SparkContext. To stop only the StreamingContext, set the optional parameter of stop() called stopSparkContext to false.

A SparkContext can be re-used to create multiple StreamingContexts, as long as the previous StreamingContext is stopped (without stopping the SparkContext) before the next StreamingContext is created

Instead, you are going to have to stop the current stream, and create a new one. You can start a stream from a specific set of offsets using one of the versions of createDirectStream that takes a fromOffsets parameter with the signature Map[TopicAndPartition, Long] -- it's the starting offset mapped by topic and partition.

Another theoretical possibility is to use KafkaUtils.createRDD which takes offset ranges as input. Say your "bad logic" started at offset X and then you fixed it at offset Y. For certain use cases, you might just want to do createRDD with the offsets from X to Y and process those results, instead of trying to do it as a stream.

Spark Streaming from Kafka Source Go Back to a Checkpoint or Rewinding

1 Answers