0
votes

I'm trying to run a Python script using Spark (1.6.1) on an Hadoop cluster (2.4.2). The cluster was installed, configured and managed using Ambari (2.2.1.1).

I have a cluster of 4 nodes (each 40Gb HD-8 cores-16Gb RAM).

My script uses sklearn lib: so in order to parallelize it on spark I use spark_sklearn lib (see it on https://databricks.com/blog/2016/02/08/auto-scaling-scikit-learn-with-spark.html).

At this point I tried to run the script with:

spark-submit spark_example.py --master yarn --deploy-mode client --num-executors 8 --num-executor-core 4 --executor-memory 2G

but it runs always on localhost with only one executor.

enter image description here

Also from Ambari dashboard I can see that only one node of the cluster is resource-consuming. And also trying different configurations (executors, cores) the execution time is the same.

UPDATE

This is Yarn UI Nodes screenshot:

enter image description here

And this is Scheduler Tab:

enter image description here enter image description here

Any ideas?

Thanks a lot

1
can you post screen shot of nodes and scheduler tab of yarn UI also. - banjara
@shekhar I added the image. Is that you want? - Pietro Fragnito
can you submit using "yarn-cluster" instead of "yarn" and see if it makes any difference? - GameOfThrows
@GameOfThrows I just did it, but I found no differences - Pietro Fragnito
@shekhar see my answer :) - Pietro Fragnito

1 Answers

0
votes

I'll respond myself thanks to an answer to the same question on Hortonworks Community.

Setting the parameter MASTER="yarn-cluster" (or MASTER="yarn-client") seems to work: now I see the application reports in Spark History and YARN History UIs.

enter image description here

ps: seems the params passed via command line (e.g.: --num-executors 8--num-executor-core 4--executor-memory 2G) are not taken in consideration. Instead, if I set the executors param in "spark-env template" filed of Ambari, the params are taken in consideration. Anyway now it works :)

I hope this helps someone in the future.