While analyzing the yarn launch_container.sh logs for a spark job, I got confused by some part of log. I will point out those asks step by step here
When you will submit a spark job with spark-submit having --pyfiles and --files on cluster mode on YARN:
The config files passed in --files , executable python files passed in --pyfiles are getting uploaded into .sparkStaging directory created under user hadoop home directory. Along with these files pyspark.zip and py4j-version_number.zip from $SPARK_HOME/python/lib is also getting copied into .sparkStaging directory created under user hadoop home directory
After this launch_container.sh is getting triggered by yarn and this will export all env variables required. If we have exported anything explicitly such as PYSPARK_PYTHON in .bash_profile or at the time of building the spark-submit job in a shell script or in spark_env.sh , the default value will be replaced by the value which we are providing
This PYSPARK_PYTHON is a path in my edge node. Then how a container launched in another node will be able to use this python version ? The default python version in data nodes of my cluster is 2.7.5. So without setting this pyspark_python , containers are using 2.7.5. But when I will set pyspark_python to 3.5.x , they are using what I have given.It is defining PWD='/data/complete-path'
Where this PWD directory resides ? This directory is getting cleaned up after job completion. I have even tried to run the job in one session of putty and kept the /data folder opened in another session of putty to see if any directories are getting created on run time. but couldn't find any?It is also setting the PYTHONPATH to $PWD/pyspark.zip:$PWD/py4j-version.zip
When ever I am doing a python specific operation in spark code , its using PYSPARK_PYTHON. So for what purpose this PYTHONPATH is being used?
3.After this yarn is creating softlinks using ln -sf for all the files in step 1
soft links are created for for pyspark.zip , py4j-<version>.zip,
all python files mentioned in step 1.
Now these links are again pointing to '/data/different_directories'
directory (which I am not sure where they are present).
I know soft links can be used for accessing remote nodes ,
but here why the soft links are created ?
Last but not the least , whether this launch_container.sh will run for each container launch ?