Cleaning up the spark streaming history on emr cluster

Question

I have referred below links and did the same changes

And restarted history-server and resource-manager but It is not deleting containers logs after defined time. it is causing issue of unhealthy node.

I have configuration like below

/etc/hadoop/conf/yarn-site.xml

    <property>
        <name>yarn.log-aggregation.retain-seconds</name>
        <value>600</value>
      </property>

/etc/logpusher/hadoop.config

"/var/log/hadoop-yarn/containers" : {
                "includes" : [ "(.*)" ],
                "s3Path" : "containers/$0",
                "retentionPeriod" : "1h",
                "deleteEmptyDirectories": true,
                "logType" : [ "USER_LOG", "SYSTEM_LOG" ]
}

3 . /etc/spark/spark-defaults.conf

spark.history.fs.cleaner.enabled true
spark.history.fs.cleaner.maxAge  1h
spark.history.fs.cleaner.interval 1h

Could you please suggest what I am missing?

It might be the event log spark.eventLog.enabled, see issues.apache.org/jira/browse/SPARK-5210 — ollik1
@Manoj Kumar Dhakad Did you resolve this issue? I am also facing the same issue with cleaning up container logs for streaming jobs. — RMu

sumitskumar sumitskumar · Accepted Answer · 2020-10-04T01:20:20

Make sure you follow below steps -

Turn on termination protection on EMR to avoid data loss due to termination, I assume you might have done this already.
The solution to avoid this is to push spark logs into s3.

For streaming jobs, this is handled by " log4j.logger.org.apache.spark.streaming=INFO,DRFA-stderr,DRFA-stdout " property in file " /etc/spark/conf/log4j.properties" .

However, this setting currently only works for Java based spark streaming apps. For python spark streaming jobs, you can use replace /etc/spark/conf/log4j.properties file with below configurations.

============================================================================

log4j.rootLogger=INFO,file
log4j.appender.file.encoding=UTF-8
log4j.appender.file=**org.apache.log4j.rolling.RollingFileAppender**
log4j.appender.file.RollingPolicy=org.apache.log4j.rolling.TimeBasedRollingPolicy
log4j.appender.file.RollingPolicy.FileNamePattern=${spark.yarn.app.container.log.dir}/spark-%d{yyyy-MM-dd-HH-mm-ss}.log
log4j.appender.file.TriggeringPolicy=org.apache.log4j.rolling.SizeBasedTriggeringPolicy
log4j.appender.file.TriggeringPolicy.maxFileSize=100000
log4j.appender.file.layout=org.apache.log4j.PatternLayout
log4j.appender.file.layout.ConversionPattern=%d{yyyy-MM-dd HH:mm:ss} %-5p %c{1}:%L - %m%n
log4j.appender.file.Append=true
============================================================================

This configuration uses the RollingFileAppender class to rotate container log files when they exceed 100,000 bytes(which is configurable). Each rotated file is named with the timestamp to prevent duplicate files from being uploaded to S3. Once you updated the file, you need to restart the spark-history server on each node in your EMR cluster.

Refer https://aws.amazon.com/premiumsupport/knowledge-center/emr-cluster-disk-space-spark-job/ for EMR versions earlier than 5.18.0

Cleaning up the spark streaming history on emr cluster

1 Answers