TL;DR is it possible to suppress individual Spark logging messages without clobbering all logging?
I'm running a Spark Streaming job on EMR, and getting logging messages like:
17/08/17 21:09:00 INFO TaskSetManager: Finished task 101.0 in stage 5259.0 (TID 315581) in 17 ms on ip-172-31-37-216.ec2.internal (107/120)
17/08/17 21:09:00 INFO MapPartitionsRDD: Removing RDD 31559 from persistence list
17/08/17 21:09:00 INFO DAGScheduler: Job 2629 finished: foreachPartition at StreamingSparkJob.scala:52, took 0.080085 s
17/08/17 21:09:00 INFO DAGScheduler: ResultStage 5259 (foreachPartition at StreamingSparkJob.scala:52) finished in 0.077 s
17/08/17 21:09:00 INFO JobScheduler: Total delay: 0.178 s for time 1503004140000 ms (execution: 0.084 s)
None of which is helpful, at this stage of development, and which masks real logging that is emitted deliberately by my application. I would like to stop Spark from emitting these log messages, or to suppress their recording.
AWS Customer Support, and various answers (e.g.) suggest that this can be achieved by passing some JSON configuration on cluster creation. However, since this is a streaming job (for which the cluster would, ideally, remain up forever and just get redeployed-to), I'd like to find some way to configure this via spark-submit
options.
Other responses (e.g., e.g.) suggest that this can be done by submitting a log4j.properties
file which sets log4j.rootCategory=WARN, <appender>
. However, this link suggests that rootCategory
is the same thing as rootLogger
, so I would interpret this as limiting all logging (not just Spark's) to WARN
- and, indeed, when I deployed a change doing this, that was what was observed.
I note that the final paragraph of here says "Spark uses log4j for logging. You can configure it by adding a log4j.properties
file in the conf
directory. One way to start is to copy the existing log4j.properties.template
located there.". I'm about to experiment with this to see whether this will suppress the INFO
logs that fill up our logging. However, this is still not an ideal solution, because there are some INFO
logs that Spark emits that are useful - for instance, when it records the number of files that were picked up (from S3) by each streaming iteration. So, what I'd ideally like would be one of:
- Configuration flags that can be switched to disable specific classes of Spark's logging messages, without suppressing all
INFO
logs - A "suppress all logging that matches this regex" option, which we can update as appropriate to filter out the messages that we're not interested in
Do either of these exist?
(To address a possible response - I'm loath to only emit logs from my own application at WARN
and above)