I'm trying to maximize cluster usage for a simple task.
Cluster is 1+2 x m3.xlarge, runnning Spark 1.3.1, Hadoop 2.4, Amazon AMI 3.7
The task reads all lines of a text file and parse them as csv.
When I spark-submit a task as a yarn-cluster mode, I get one of the following result:
- 0 executor: job waits infinitely until I manually kill it
- 1 executor: job under utilize resources with only 1 machine working
- OOM when I do not assign enough memory on the driver
What I would have expected:
- Spark driver run on cluster master with all memory available, plus 2 executors with 9404MB each (as defined by install-spark script).
Sometimes, when I get a "successful" execution with 1 executor, cloning and restarting the step ends up with 0 executor.
I created my cluster using this command:
aws emr --region us-east-1 create-cluster --name "Spark Test"
--ec2-attributes KeyName=mykey
--ami-version 3.7.0
--use-default-roles
--instance-type m3.xlarge
--instance-count 3
--log-uri s3://mybucket/logs/
--bootstrap-actions Path=s3://support.elasticmapreduce/spark/install-spark,Args=["-x"]
--steps Name=Sample,Jar=s3://elasticmapreduce/libs/script-runner/script-runner.jar,Args=[/home/hadoop/spark/bin/spark-submit,--master,yarn,--deploy-mode,cluster,--class,my.sample.spark.Sample,s3://mybucket/test/sample_2.10-1.0.0-SNAPSHOT-shaded.jar,s3://mybucket/data/],ActionOnFailure=CONTINUE
With some step variations including:
--driver-memory 8G --driver-cores 4 --num-executors 2
install-spark script with -x produces the following spark-defaults.conf:
$ cat spark-defaults.conf
spark.eventLog.enabled false
spark.executor.extraJavaOptions -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70
spark.driver.extraJavaOptions -Dspark.driver.log.level=INFO
spark.executor.instances 2
spark.executor.cores 4
spark.executor.memory 9404M
spark.default.parallelism 8
Update 1
I get the same behavior with a generic JavaWordCount example:
/home/hadoop/spark/bin/spark-submit --verbose --master yarn --deploy-mode cluster --driver-memory 8G --class org.apache.spark.examples.JavaWordCount /home/hadoop/spark/lib/spark-examples-1.3.1-hadoop2.4.0.jar s3://mybucket/data/
However, if I remove the '--driver-memory 8G', the task gets assigned 2 executors and finishes correctly.
So, what's the matter with driver-memory preventing my task to get executors?
Should the driver be executed on the cluster's master node alongside with Yarn master container as explained here?
How do I give more memory to my spark job driver? (Where collects and some other useful operations arise)