Handling data - Spark structured streaming

Question

As far as I know, spark structured streaming is fault Tolerance by using checkpoints.

I want to read from kafka.

So let's say that I use checkpoint, and then for some reason my code crashes / I stops it, then I expect that when I rerun the code it will recover the processed data.

My problem is that in the reading configuration, if I set the offset to earliest so after rerunning the code I will read the same data again, and if I put latest I won't read the data between the code crashes til I rerun the code.

Does there is a way to read only unread messages from kafka with spark 2.3 - structured streaming (pyspark), and to recover processed data from checkpoints?

deo deo · Accepted Answer · 2019-04-02T22:24:55

It depends where your code crashes. You don't need to set it earliest, you can set it to latest. You can always recover from checkpointing and reprocess the data, Here is the semantics of checkpointing

Handling data - Spark structured streaming

1 Answers