I am starting with spark and have never worked with Hadoop. I have 10 iMacs on which I have installed Spark 1.6.1 with Hadoop 2.6. I downloaded the precompiled version and just copied the extracted contents into /usr/local/spark/
. I did all the environment variables setup with SCALA_HOME
, changes to PATH
and other spark conf. I am able to run both spark-shell
and pyspark
(with anaconda's python).
I have setup the standalone cluster; all the nodes are showing up on my web UI. Now, by using the python shell (ran on the cluster not locally) I followed this link's python interpreter word count example.
This is the code I have used
from operator import add
def tokenize(text):
return text.split()
text = sc.textFile("Testing/shakespeare.txt")
words = text.flatMap(tokenize)
wc = words.map(lambda x: (x,1))
counts = wc.reduceByKey(add)
counts.saveAsTextFile("wc")
It is giving me error that the file shakespeare.txt
was not found on a slave nodes. Searching around I understood that if I am not using HDFS then the file should be present on each slave node on the same path. Here is the stack trace - github gist
Now, I have a few questions-
Isn't RDD supposed to be distributed? That is, it should have distributed (when the action was run on RDD) the file on all the nodes instead of requiring me to distribute it.
I downloaded the spark with Hadoop 2.6, but any of the Hadoop commands are not available to make a HDFS. I extracted the Hadoop jar file found in the
spark/lib
hoping to find some executable but there was nothing. So, what Hadoop related files were provided in the spark download?Lastly, how can I run a distributed application (spark-submit) or a distributed analysis (using pyspark) on the cluster? If I have to create a HDFS then what extra steps are required? Also, how can I create a HDFS here?