I'm trying to run a Python script using Spark (1.6.1) on an Hadoop cluster (2.4.2). The cluster was installed, configured and managed using Ambari (2.2.1.1).
I have a cluster of 4 nodes (each 40Gb HD-8 cores-16Gb RAM).
My script uses sklearn lib: so in order to parallelize it on spark I use spark_sklearn lib (see it on https://databricks.com/blog/2016/02/08/auto-scaling-scikit-learn-with-spark.html).
At this point I tried to run the script with:
spark-submit spark_example.py --master yarn --deploy-mode client --num-executors 8 --num-executor-core 4 --executor-memory 2G
but it runs always on localhost with only one executor.
Also from Ambari dashboard I can see that only one node of the cluster is resource-consuming. And also trying different configurations (executors, cores) the execution time is the same.
UPDATE
This is Yarn UI Nodes screenshot:
And this is Scheduler Tab:
Any ideas?
Thanks a lot




