We are trying to build a fault tolerant spark streaming job, there's one problem we are running into. Here's our scenario:
1) Start a spark streaming process that runs batches of 2 mins
2) We have checkpoint enabled. Also the streaming context is configured to either create a new context or build from checkpoint if one exists
3) After a particular batch completes, the spark streaming job is manually killed using yarn application -kill (basically mimicking a sudden failure)
4) The spark streaming job is then restarted from checkpoint
The issue that we are having is that after the spark streaming job is restarted it replays the last successful batch. It always does this, just the last successful batch is replayed, not the earlier batches
The side effect of this is that the data part of that batch is duplicated. We even tried waiting for more than a minute after the last successful batch before killing the process (just in case writing to checkpoint takes time), that didn't help
Any insights? I have not added the code here, hoping someone has faced this as well and can give some ideas or insights. Can post the relevant code as well if that helps. Shouldn't spark streaming checkpoint right after a successful batch so that it is not replayed after a restart? Does it matter where I place the ssc.checkpoint command?