In Spark Streaming it is possible (and mandatory if you're going to use stateful operations) to set the StreamingContext
to perform checkpoints into a reliable data storage (S3, HDFS, ...) of (AND):
- Metadata
DStream
lineage
As described here, to set the output data storage you need to call yourSparkStreamingCtx.checkpoint(datastoreURL)
On the other hand, it is possible to set lineage checkpoint intervals for each DataStream
by just calling checkpoint(timeInterval)
at them. In fact, it is recommended to set lineage checkpoint interval between 5 and 10 times the DataStream
's sliding interval:
dstream.checkpoint(checkpointInterval). Typically, a checkpoint interval of 5 - 10 sliding intervals of a DStream is a good setting to try.
My question is:
When the streaming context has been set to perform checkpointing and no ds.checkpoint(interval)
is called, is lineage checkpointing enabled for all data streams with a default checkpointInterval
equal to batchInterval
? Or is, on the contrary, only metadata checkpointing what is enabled?
strmCtx.checkpoint("hdfs://...")
it also enables all data streams checkpoints with a update interval equal to the context batch interval. – Pablo Francisco Pérez HidalgoDStreams
enable checkpoint when theStreamingContext
has been set to perform checkpoints. – Pablo Francisco Pérez Hidalgo