pyspark job execution in yarn cluster

Question

I am trying to understand how spark job work in yarn cluster

I am using below commands to submit job

spark-submit --master yarn --deploy-mode cluster sparksessionexample.py

After submitting job console shows below console log

2020-05-29 20:52:48,668 INFO yarn.Client: Uploading resource file:/tmp/spark-bcd415f0-a22e-46b2-951c-5b6e4385a0c6/__spark_libs__2908230569257238890.zip -> hdfs://localhost:9000/user/hadoop/.sparkStaging/application_1590759398715_0003/__spark_libs__2908230569257238890.zip
2020-05-29 20:53:14,164 INFO yarn.Client: Uploading resource file:/home/hadoop/pythonprojects/Python/src/spark_jobs/sparksessionexample.py -> hdfs://localhost:9000/user/hadoop/.sparkStaging/application_1590759398715_0003/sparksessionexample.py
2020-05-29 20:53:14,610 INFO yarn.Client: Uploading resource file:/home/hadoop/clouderaapp/apache-spark/python/lib/pyspark.zip -> hdfs://localhost:9000/user/hadoop/.sparkStaging/application_1590759398715_0003/pyspark.zip
2020-05-29 20:53:15,984 INFO yarn.Client: Uploading resource file:/home/hadoop/clouderaapp/apache-spark/python/lib/py4j-0.10.7-src.zip -> hdfs://localhost:9000/user/hadoop/.sparkStaging/application_1590759398715_0003/py4j-0.10.7-src.zip
2020-05-29 20:53:18,362 INFO yarn.Client: Uploading resource file:/tmp/spark-bcd415f0-a22e-46b2-951c-5b6e4385a0c6/__spark_conf__7123551182035223076.zip -> hdfs://localhost:9000/user/hadoop/.sparkStaging/application_1590759398715_0003/__spark_conf__.zip

I just to understand how yarn execute sparksessionexample.py file, i mean whether it create python virtual env on node? as above log shows only uploading lib, confs but what about python client to execute sparksessionexample.py?

Can anyone help understand this?

Samson Scharfrichter Samson Scharfrichter · Accepted Answer · 2020-05-29T17:53:23

The "Spark client" is used to bootstrap the Spark job execution.

In your case it is the only thing that runs on your local machine, because you requested cluster execution mode:

the "client" contacts the cluster manager (here YARN Resource Manager, could be Kubernetes Master, etc.) to start the Spark driver inside an AppMaster container
then the driver contacts again the cluster manager to request some containers for the executors
then the driver runs your Python code and distributes the work to the executors
finally the driver de-allocates its executors and itself
at this point the "client" notices that the YARN job has reached success or failure status, and can terminate

In short, the "client" never gets any kind of useful information from the driver running inside the cluster. You must inspect the YARN logs for the container running the driver (it's the AppMaster, typically number 00001).

client execution mode

pyspark job execution in yarn cluster

1 Answers