2
votes

I have a Spark Streaming job running on a cluster (Spark 1.6) which checkpoints to S3. When I start up the job initially, I can see "Streaming" tab. However when I restart the job from checkpoint the Streaming tab disappears. The job still works as a streaming job and I see the batches appear at the configured batch interval. See below.

Snapshot

If I clear out the checkpoint data, the tab comes back. I suspect that the Streaming tab is not registered correctly while restarting from a checkpoint.

I looked at the Spark Streaming code. Is it possible this flow is not invoked when the application state is deserialised from a checkpoint?

Does anyone know how to fix this?

1

1 Answers

2
votes

If I clear out the checkpoint data, the tab comes back. I suspect that the Streaming tab is not registered correctly while restarting from a checkpoint.

It is invoked, but the streaming tab doesn't appear until it finishes loading all the data from the S3 checkpoint location. If your lineage is long, it may take some time to load. Once all the data is restored from checkpoint, you'll see the streaming tab appear.