3
votes

We have spark streaming from kafka creating checkpoints in the HDFS server and its not getting cleaned up , Now we have millions of checkpoints in the HDFS. Is there a way to clean it automatically from spark ?

Spark Version 1.6 HDFS 2.70

There  are other random directories other than checkpoints which is not been cleared

1

1 Answers

4
votes
val conf = new SparkConf().set("spark.cleaner.referenceTracking.cleanCheckpoints", "true")

Cleaning should not be done automatically for all checkpoints, it is necessary to keep them around across spark invocations.As Spark Streaming saves intermediate state datasets as checkpoints and relies on them to recover from driver failures.