In Spark Streaming it is possible (and mandatory if you're going to use stateful operations) to set the StreamingContext to perform checkpoints into a reliable data storage (S3, HDFS, ...) of (AND):
- Metadata
DStreamlineage
As described here, to set the output data storage you need to call yourSparkStreamingCtx.checkpoint(datastoreURL)
On the other hand, it is possible to set lineage checkpoint intervals for each DataStream by just calling checkpoint(timeInterval) at them. In fact, it is recommended to set lineage checkpoint interval between 5 and 10 times the DataStream's sliding interval:
dstream.checkpoint(checkpointInterval). Typically, a checkpoint interval of 5 - 10 sliding intervals of a DStream is a good setting to try.
My question is:
When the streaming context has been set to perform checkpointing and no ds.checkpoint(interval) is called, is lineage checkpointing enabled for all data streams with a default checkpointInterval equal to batchInterval? Or is, on the contrary, only metadata checkpointing what is enabled?
strmCtx.checkpoint("hdfs://...")it also enables all data streams checkpoints with a update interval equal to the context batch interval. - Pablo Francisco Pérez HidalgoDStreamsenable checkpoint when theStreamingContexthas been set to perform checkpoints. - Pablo Francisco Pérez Hidalgo