We have spark streaming from kafka creating checkpoints in the HDFS server and its not getting cleaned up , Now we have millions of checkpoints in the HDFS. Is there a way to clean it automatically from spark ?
Spark Version 1.6 HDFS 2.70
val conf = new SparkConf().set("spark.cleaner.referenceTracking.cleanCheckpoints", "true")
Cleaning should not be done automatically for all checkpoints, it is necessary to keep them around across spark invocations.As Spark Streaming saves intermediate state datasets as checkpoints and relies on them to recover from driver failures.