spark streaming checkpoint : Data checkpointing control

Question

I have something confused about the spark streaming checkpoint, please help me, thanks!

There are two types of checkpointing (Metadata & Data checkpointing). And the guides said when using stateful transformations, data checkpointing is used. I'm very confused about this. If I don't use stateful transformations, does spark still write Data checkpointing content?
Can I control the checkpoint position in codes ? Can I control which rdd can be written to data checkpointing data in streaming like batch spark job ? Can I use foreachRDD rdd => rdd.checkpoint() in streaming?
If I don't use the rdd.checkpoint(), what is the default behavior of Spark? Which rdd can be written to HDFS?

VahagnNikoghosian VahagnNikoghosian · Accepted Answer · 2019-02-27T15:39:39

You can find excellent guide with this Link.

No, there is no need to checkpoint data, because no intermediate data you need in case of stateless computation.
I don't think you need checkpoint any rdd after computation in streaming. The rdd checkpoint is designed to address lineage issue, the streaming checkpoint is all about streaming reliability and failure recovery.