10
votes

We have a spark streaming application which is a long running task. The event log is pointed to hdfs location hdfs://spark-history, the application_XXX.inprogress file is being created in it when we start streaming application and size of the file growing up to 70GB. To delete the log file we are stopping spark streaming application and clearing it. Is there any way to automate this process with out stopping or restarting application. We have configured the spark.history.fs.cleaner.enabled=true with cleaning interval as 1 day and max Age as 2 days. however it is not cleaning the .inprogress file. we are using spark 1.6.2 version. We are running the spark on yarn and deployed in cluster mode.

1
What is retention policy for hdfs log storage ?FaigB

1 Answers

3
votes

This issue you have to do some changes in few configurations, you have to add few changes to your file yarn-default.xml. In this file you need to change this row or add this row:

yarn.nodemanager.log-aggregation.roll-monitoring-interval-seconds=3600

This modification will aggregate your files to you, this will allow you to see the data via yarn logs -applicationId YOUR_APP_ID

This is the first step. You can see a little about this here.

Seccond Step you need to create a file log4j-driver.property and a log4j-executor.property

In this file you can use this example:

log4j.rootLogger=INFO, rolling
log4j.appender.rolling=org.apache.log4j.RollingFileAppender
log4j.appender.rolling.layout=org.apache.log4j.PatternLayout
log4j.appender.rolling.layout.conversionPattern=[%d] %p %m (%c)%n
log4j.appender.rolling.maxFileSize=50MB
log4j.appender.rolling.maxBackupIndex=5
log4j.appender.rolling.file=/var/log/spark/${dm.logging.name}.log
log4j.appender.rolling.encoding=UTF-8
log4j.logger.org.apache.spark=WARN
log4j.logger.org.eclipse.jetty=WARN
log4j.logger.com.anjuke.dm=${dm.logging.level}

What this rows are saing?

This guy: log4j.appender.rolling.maxFileSize=50MB will create just files with 50MB of size. When a log file reach 50MB it will be closed and a new one will start.

The other row that is relevant is this one: log4j.appender.rolling.maxBackupIndex=5 this means that you will have a backup history of 5 files of 50MB. During the time this will be deleted when a new files start to show.

After you create this log file you need to send this via spark-submit command:

spark-submit
  --master spark://127.0.0.1:7077
  --driver-java-options "-Dlog4j.configuration=file:/path/to/log4j-driver.properties -Ddm.logging.level=DEBUG"
  --conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=file:/path/to/log4j-executor.properties -Ddm.logging.name=myapp -Ddm.logging.level=DEBUG"
  ...

You can create a log file for your Driver and for your Workers. In the command I'm using two different files but you can use the same. For more details you can see here.