According to official Spark documentation (http://spark.apache.org/docs/latest/job-scheduling.html#configuration-and-setup), when using "spark.dynamicAllocation" option with YARN, you need to:
In the yarn-site.xml on each node, add spark_shuffle to yarn.nodemanager.aux-services ...
set yarn.nodemanager.aux-services.spark_shuffle.class to org.apache.spark.network.yarn.YarnShuffleService
Despite that AWS EMR documentation says, that
"..Spark Shuffle Service is automatically configured by EMR. (http://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-spark-configure.html)
I've noticed, that "yarn.nodemanager.aux-services" in "yarn-site" on EMR nodes is set to:
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle,</value>
</property>
and no section for "yarn.nodemanager.aux-services.spark_shuffle.class" is added at all.
I'm a bit new to Spark/Hadoop ecosystem, so this raised a few questions in my head:
- Why I'm still able to successfully run Spark jobs with "spark.dynamicAllocation" set to "true", while the basic configuration requirements are not met? Does this mean that Spark somehow could use "mapreduce_shuffle" as a fallback?
- If the assumption above (Spark falls back to "mapreduce_shuffle") is true, are there possible performance (other?) implications from using improper shuffle class ("mapreduce_shuffle" maps to "org.apache.hadoop.mapred.ShuffleHandler" class)?
Note: I'm using emr-ami v. 4.6.0