Spark not using Yarn Cluster Resources

Question

I'm trying to run a Python script using Spark (1.6.1) on an Hadoop cluster (2.4.2). The cluster was installed, configured and managed using Ambari (2.2.1.1).

I have a cluster of 4 nodes (each 40Gb HD-8 cores-16Gb RAM).

My script uses sklearn lib: so in order to parallelize it on spark I use spark_sklearn lib (see it on https://databricks.com/blog/2016/02/08/auto-scaling-scikit-learn-with-spark.html).

At this point I tried to run the script with:

spark-submit spark_example.py --master yarn --deploy-mode client --num-executors 8 --num-executor-core 4 --executor-memory 2G

but it runs always on localhost with only one executor.

Also from Ambari dashboard I can see that only one node of the cluster is resource-consuming. And also trying different configurations (executors, cores) the execution time is the same.

UPDATE

This is Yarn UI Nodes screenshot:

And this is Scheduler Tab:

Any ideas?

Thanks a lot

can you post screen shot of nodes and scheduler tab of yarn UI also. — banjara
can you submit using "yarn-cluster" instead of "yarn" and see if it makes any difference? — GameOfThrows

Pietro Fragnito Pietro Fragnito · Accepted Answer · 2016-05-12T08:48:21

I'll respond myself thanks to an answer to the same question on Hortonworks Community.

Setting the parameter MASTER="yarn-cluster" (or MASTER="yarn-client") seems to work: now I see the application reports in Spark History and YARN History UIs.

ps: seems the params passed via command line (e.g.: --num-executors 8--num-executor-core 4--executor-memory 2G) are not taken in consideration. Instead, if I set the executors param in "spark-env template" filed of Ambari, the params are taken in consideration. Anyway now it works :)

I hope this helps someone in the future.

Spark not using Yarn Cluster Resources

1 Answers