I'm using Spark
with HDFS Hadoop
Storage and Yarn
. My cluster contains 5 nodes (1 master and 4 slaves).
- Master node : 48Gb RAM - 16 CPU Cores
- Slave nodes : 12 Gb RAM - 16 CPU Cores
I'm executing two different processes : WordCount method and SparkSQL with two different files. All is working but I'm asking some question, maybe I don't understand very well Hadoop-Spark.
First example : WordCount
I executed WordCount function and I get result in two files (part-00000 and part-00001). The availability is slave4 and slave1 for part-00000 and slave3 and slave4 for part-00001.
Why not a part on slave2 ? Is it normal ?
When I look application_ID, I see the only 1 slave made the job :
Why my task is not well-distributed over my cluster ?
Second example : SparkSQL
In this case, I don't have saved file because I just want to return an SQL result, but only 1 slave node works too.
So why I have only 1 slave node which makes the task while I have a cluster which seems to work fine ?
The command line to execute this is :
time ./spark/bin/spark-submit --master yarn --deploy-mode cluster /home/valentin/SparkCount.py
Thank you !
spark-submit
to use more than the default values – OneCricketeer