3
votes

Summary: I can't get my python-spark job to run on all nodes of my hadoop cluster. I've installed the spark for hadoop 'spark-1.5.2-bin-hadoop2.6'. When launching a java spark job, the load gets distributed over all nodes, when launching a python spark job, only the one node takes the load.

Setup:

  • hdfs and yarn configured for 4 nodes: nk01 (namenode), nk02, nk03, nk04, running on xen virtual servers
  • versions: jdk1.8.0_66, hadoop-2.7.1, spark-1.5.2-bin-hadoop2.6
  • hadoop installed all 4 nodes
  • spark only installed on nk01

I copied a bunch of Gutenberg files (thank you, Johannes!) onto hdfs, and try doing a wordcount using java and python on a subset of the files (the files that start with an 'e') :

Python:

Using a homebrew python script for doing wordcount:

/opt/spark/bin/spark-submit wordcount.py --master yarn-cluster \
    --num-executors 4 --executor-cores 1

The Python code assigns 4 partions:

tt=sc.textFile('/user/me/gutenberg/text/e*.txt',4)

Load on the 4 nodes during 60 seconds:

load

Java:

Using the JavaWordCount found in the spark distribution:

/opt/spark/bin/spark-submit --class JavaWordCount --master yarn-cluster \
    --num-executors 4 jwc.jar '/user/me/gutenberg/text/e*.txt'

load

Conclusion: the java version distributes its load across the cluster, the python version just runs on 1 node.

Question: how do I get the python version also to distribute the load across all nodes?

2

2 Answers

5
votes

The python-program name was indeed in the wrong position, as suggested by Shawn Guo. It should have been run this way:

/opt/spark/bin/spark-submit --master yarn-cluster --num-executors 4 
       --executor-cores 1 wordcount.py

That gives this load on the nodes: enter image description here

3
votes

Spark-submit

./bin/spark-submit \
  --class <main-class>
  --master <master-url> \
  --deploy-mode <deploy-mode> \
  --conf <key>=<value> \
  ... # other options
  <application-jar> \
  [application-arguments]

Here are some different with scala/java submit in parameter position.

For Python applications, simply pass a .py file in the place of application-jar instead of a JAR, and add Python .zip, .egg or .py files to the search path with --py-files.

You should use below command instead:
/opt/spark/bin/spark-submit --master yarn-cluster wordcount.py --num-executors 4 --executor-cores 1