0
votes

I have something confused about the spark streaming checkpoint, please help me, thanks!

  1. There are two types of checkpointing (Metadata & Data checkpointing). And the guides said when using stateful transformations, data checkpointing is used. I'm very confused about this. If I don't use stateful transformations, does spark still write Data checkpointing content?

  2. Can I control the checkpoint position in codes ? Can I control which rdd can be written to data checkpointing data in streaming like batch spark job ? Can I use foreachRDD rdd => rdd.checkpoint() in streaming?

  3. If I don't use the rdd.checkpoint(), what is the default behavior of Spark? Which rdd can be written to HDFS?

1

1 Answers

0
votes

You can find excellent guide with this Link.

  1. No, there is no need to checkpoint data, because no intermediate data you need in case of stateless computation.
  2. I don't think you need checkpoint any rdd after computation in streaming. The rdd checkpoint is designed to address lineage issue, the streaming checkpoint is all about streaming reliability and failure recovery.