2
votes

Do we need to checkpoint both readStream and writeStream of Kafka in Spark Structured Streaming ? When do we need to checkpoint both of these streams or only one of these streams?

1

1 Answers

5
votes

Checkpointing is needed to save information about processed data by a stream and in case of failure spark could recover from last saved progress point. Processed means it is read from source, (transformed) and finally written to a sink.

Therefore, there is no need to set checkpointing for reader and writer separately since it make no sense after recovery not to process the data that was only read but not written to a sink. Moreover, checkpointing location can be set as an option to DataStreamWriter only (returns from dataset.writeStream()) and before starting a stream.

Here is an example of a simple structured stream with checkpointing:

session
    .readStream()
    .schema(RecordSchema.fromClass(TestRecord.class))
    .csv("s3://test-bucket/input")
    .as(Encoders.bean(TestRecord.class))
    .writeStream()
    .outputMode(OutputMode.Append())
    .format("csv")
    .option("path", "s3://test-bucket/output")
    .option("checkpointLocation", "s3://test-bucket/checkpoint")
    .queryName("test-query")
    .start();