2
votes

TL;DR is it possible to suppress individual Spark logging messages without clobbering all logging?

I'm running a Spark Streaming job on EMR, and getting logging messages like:

17/08/17 21:09:00 INFO TaskSetManager: Finished task 101.0 in stage 5259.0 (TID 315581) in 17 ms on ip-172-31-37-216.ec2.internal (107/120) 17/08/17 21:09:00 INFO MapPartitionsRDD: Removing RDD 31559 from persistence list 17/08/17 21:09:00 INFO DAGScheduler: Job 2629 finished: foreachPartition at StreamingSparkJob.scala:52, took 0.080085 s 17/08/17 21:09:00 INFO DAGScheduler: ResultStage 5259 (foreachPartition at StreamingSparkJob.scala:52) finished in 0.077 s 17/08/17 21:09:00 INFO JobScheduler: Total delay: 0.178 s for time 1503004140000 ms (execution: 0.084 s)

None of which is helpful, at this stage of development, and which masks real logging that is emitted deliberately by my application. I would like to stop Spark from emitting these log messages, or to suppress their recording.

AWS Customer Support, and various answers (e.g.) suggest that this can be achieved by passing some JSON configuration on cluster creation. However, since this is a streaming job (for which the cluster would, ideally, remain up forever and just get redeployed-to), I'd like to find some way to configure this via spark-submit options.

Other responses (e.g., e.g.) suggest that this can be done by submitting a log4j.properties file which sets log4j.rootCategory=WARN, <appender>. However, this link suggests that rootCategory is the same thing as rootLogger, so I would interpret this as limiting all logging (not just Spark's) to WARN - and, indeed, when I deployed a change doing this, that was what was observed.

I note that the final paragraph of here says "Spark uses log4j for logging. You can configure it by adding a log4j.properties file in the conf directory. One way to start is to copy the existing log4j.properties.template located there.". I'm about to experiment with this to see whether this will suppress the INFO logs that fill up our logging. However, this is still not an ideal solution, because there are some INFO logs that Spark emits that are useful - for instance, when it records the number of files that were picked up (from S3) by each streaming iteration. So, what I'd ideally like would be one of:

  • Configuration flags that can be switched to disable specific classes of Spark's logging messages, without suppressing all INFO logs
  • A "suppress all logging that matches this regex" option, which we can update as appropriate to filter out the messages that we're not interested in

Do either of these exist?

(To address a possible response - I'm loath to only emit logs from my own application at WARN and above)

1

1 Answers

2
votes

you could control logs by logger namesparce from log4j.properties, here is an example:

log4j.rootLogger=WARN, console
# add a ConsoleAppender to the logger stdout to write to the console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.out
log4j.appender.console.layout=org.apache.log4j.PatternLayout
# use a simple message format
log4j.appender.console.layout.ConversionPattern=%d{yyyy-MM-dd HH:mm:ss} %-5p %c{1}:%L - %m%n
# set the log level for these components
log4j.logger.org.apache.spark=WARN
log4j.logger.org.spark-project=ERROR
log4j.logger.org.apache.hadoop=ERROR
log4j.logger.io.netty=ERROR
log4j.logger.org.apache.zookeeper=ERROR