3
votes

I'm a beginner with Spark, Hadoop and Yarn. I install Spark with : https://spark.apache.org/docs/2.3.0/ and Hadoop/Yarn with : https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleCluster.html. My aim is to run spark application on yarn cluster but I have problems. How do we know when our setup works ? I will show you my example. After doing my setup, I tried to run the test jar : examples/jars/spark-examples*.jar. When I run locally spark with : ./bin/spark-submit --class org.apache.spark.examples.SparkPi , I see at one moment the line : "Pi is roughly 3.1370956854784273", whereas when I want to run on a yarn cluster with : ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode cluster examples/jars/spark-examples*.jar I don't see "Pi is roughly 3.1370956854784273" in the console and I don't know where I can find this. I watch the log in the Url http://localhost:8088/cluster/cluster but it doesn't appear. Do you know where I should look ? Thanks for your help and have a nice day.

5
Hello everyone, of course ! I forgot it ; ) !THIBAULT Nicolas
Then to precise, I see that a guy have a similar problem on the site but I don't understand the answer.THIBAULT Nicolas

5 Answers

1
votes

You can use view the same using resource manager and the application id or by using the following command you will get the entire log for the application using
yarn logs -applicationId application ID

1
votes

In yarn cluster mode, the default output console is not really your driver (where you submit your job) but the yarn logs it self. So you can run

yarn logs -applicationId application_1549879021111_0007 >application_1549879021111_0007.log

and after

more application_1549879021111_0007.log

Then you can use /pattern where pattern is a word or expression that you have in your print command in inside your python script. Usually, I use

print ('####' + expression to print + '###')

After I can do /### to find my print

0
votes

You need to find the Spark driver container in YARN, or from the Spark UI. From there, you can go to the Executors tab, and you will see the stdout and stderr links for each one (plus, the Driver, where the final output will be).

Overtime, YARN will evict these logs, which is why you would need log aggregation enabled and the Spark History Server deployed.


FWIW, Cloudera is going all-in on running Spark on Kubernetes in recent announcements. Not sure what that says for having YARN (or HDFS with Ceph or S3 being popular datastores with these deployments)

0
votes

I encountered with the same issue and finally able to check the "Pi is roughly 3.14..." after the following steps:

First enable yarn log aggregation in every nodes by adding these lines to yarn-site.xml

<property>
    <name>yarn.log-aggregation-enable</name>
    <value>true</value>
</property>
<property>
    <name>yarn.nodemanager.log-aggregation.roll-monitoring-interval-seconds</name>
    <value>3600</value>
</property>

You may need to restart yarn and dfs after modification of yarn-site.xml

Then check the logs by command line:

yarn logs -applicationId <applicationID>

yarn logs -applicationId <applicationID> Pi is roughly...

Hope it helps.

-1
votes

You will have to write the console output to a file, what this will do is that it will write the output of your spark program being executed into a file, you can use tail -f 100 on the consoleoutfile.txt mentioned below to see your console output.

./submit_command > local_fs_path/consoleoutfile.txt 2>&1