1
votes

We are using Spark + Java in our project, and the Hadoop distribution being used is MapR.

In our Spark jobs we persist data (at disk level).

After the job completes, there is lot of temp data inside the /tmp/ folder. How can we ensure that /tmp/ folder (temp data) gets empty after the job execution completes.

I found a link below: Apache Spark does not delete temporary directories

But not sure how to set the following properties:

  • spark.worker.cleanup.enabled

  • spark.worker.cleanup.interval

  • spark.worker.cleanup.appDataTtl

Also, where to set these properties: 1. In Code Or 2. In spark configuration

We are running the job in cluster mode (with master yarn), using spark-submit command.

Thanks Anuj

1

1 Answers

0
votes
  1. Create a backup of the spark-env.sh file. Open the file in a text editor (e.g. vi) and locate "SPARK_WORKER_OPTS"

  2. Immediately below this line, add or update:SPARK_WORKER_OPTS="-Dspark.worker.cleanup.enabled=true -Dspark.worker.cleanup.appDataTtl=172800"

  3. This should enable work logs cleanup, and will retain logs for no longer than 48 hours, with a default check time of every 30 minutes.

Restart Spark and done!