I am new to Spark Streaming and have little knowledge about checkpoint.Is streaming data stored in the checkpoint? Is the data stored in hdfs or memory ?How much space will it takes?
1 Answers
according to : Spark The definitive guide
The most important operational concern for a streaming application is failure recovery. Faults are inevitable: you’re going to lose a machine in the cluster, a schema will change by accident without a proper migration, or you may even intentionally restart the cluster or application. In any of these cases, Structured Streaming allows you to recover an application by just restarting it. To do this, you must configure the application to use checkpointing and write-ahead logs, both of which are handled automatically by the engine. Specifically, you must configure a query to write to a checkpoint location on a reliable file system (e.g., HDFS, S3, or any compatible filesystem). Structured Streaming will then periodically save all relevant progress information (for instance, the range of offsets processed in a given trigger) as well as the current intermediate state values to the checkpoint location. In a failure scenario, you simply need to restart your application, making sure to point to the same checkpoint location, and it will automatically recover its state and start processing data where it left off. You do not have to manually manage this state on behalf of the application—Structured Streaming does it for you.
I conclude that it is job progress information and intermediate results in which stored in checkpoint not the data, checkpoint location has to be a path in an HDFS compatible file system and the required space is based on the intermediate generated output.