0
votes

As far as I know, spark structured streaming is fault Tolerance by using checkpoints.

I want to read from kafka.

So let's say that I use checkpoint, and then for some reason my code crashes / I stops it, then I expect that when I rerun the code it will recover the processed data.

My problem is that in the reading configuration, if I set the offset to earliest so after rerunning the code I will read the same data again, and if I put latest I won't read the data between the code crashes til I rerun the code.

Does there is a way to read only unread messages from kafka with spark 2.3 - structured streaming (pyspark), and to recover processed data from checkpoints?

1

1 Answers

0
votes

It depends where your code crashes. You don't need to set it earliest, you can set it to latest. You can always recover from checkpointing and reprocess the data, Here is the semantics of checkpointing