5
votes

As explained in previous answers, the ideal way to change the verbosity of a Spark cluster is changing the corresponding log4j.properties. However, on dataproc Spark runs on Yarn, therefore we have to adjust the global configuration and not /usr/lib/spark/conf

Several suggestions:

On dataproc we have several gcloud commands and properties we can pass during cluster creation. See documentation Is it possible to change the log4j.properties under /etc/hadoop/conf by specifying

--properties 'log4j:hadoop.root.logger=WARN,console'

Maybe not, as from the docs:

The --properties command cannot modify configuration files not shown above.

Another way would be to use a shell script during cluster init and run sed:

# change log level for each node to WARN
sudo sed -i -- 's/log4j.rootCategory=INFO, console/log4j.rootCategory=WARN, console/g'\
                     /etc/spark/conf/log4j.properties
sudo sed -i -- 's/hadoop.root.logger=INFO,console/hadoop.root.logger=WARN,console/g'\
                    /etc/hadoop/conf/log4j.properties

But is it enough or do we need to change the env variable hadoop.root.logger as well?

2
The 2nd way actually works for me, but I still wonder if there's a better way without editing the config files which might change over time and releases.Frank

2 Answers

4
votes

At the moment, you're right that --properties doesn't support extra log4j settings, but it's certainly something we've talked about adding; some considerations include how much to balance the ability to do fine-grained control over Spark vs Yarn vs other long-running daemons' logging configs (hiveserver2, HDFS daemons, etc) compared to keeping a minimal/simple setting which is plumbed through to everything in a shared way.

At least for Spark driver logs, you can use the --driver-log-levels setting a job-submission time which should take precedence over any of the /etc/*/conf settings, but otherwise as you describe, init actions are a reasonable way to edit the files for now on cluster startup, keeping in mind that they may change over time and releases.

0
votes

Recently, the support for log4j properties have been added via the --properties tag. For example: you can now use "--properties 'hadoop-log4j:hadoop.root.logger=WARN,console'". See this page(https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/cluster-properties) for more details