1
votes

I am having some questions about fault tolerance in Spark Structured Streaming, when reading from kafka. This is from the Structured Streaming Programming Guide:

In case of a failure or intentional shutdown, you can recover the previous progress and state of a previous query, and continue where it left off. This is done using checkpointing and write-ahead logs.

1) How to do you restart a failed query? Can it be done automatically?

You can configure a query with a checkpoint location, and the query will save all the progress information (i.e. range of offsets processed in each trigger) and the running aggregates (e.g. word counts in the quick example) to the checkpoint location. This checkpoint location has to be a path in an HDFS compatible file system, and can be set as an option in the DataStreamWriter when starting a query.

2) What happens if you do not specify a checkpoint location? Is a default location chosen or you have no fault tolerance guarantees? Can you specify as a checkpoint location a path to the local non-hdfs file system of a single node?

1
stackoverflow.com/questions/55040102/… If you do not enable this (specify a checkpoint directory) cannot recover from old data when it restarts.shanmuga

1 Answers

2
votes

You can find the answer of your questions from streamingcontext.java https://spark.apache.org/docs/2.2.0/api/java/org/apache/spark/streaming/StreamingContext.html

No CheckPoint Location-

If we do not specify the check point location then we wouldn't be able to recover for failure.

Default CheckPoint Location

There is no default checkpoint location. we need to specify.

Non-hdfs Checkpoint location

HDFS-compatible directory where the checkpoint data will be reliably stored. Note that this must be a fault-tolerant file system like HDFS. so there is no use of specifying local checkpoint location.