Spark RDD problems

Question

I am starting with spark and have never worked with Hadoop. I have 10 iMacs on which I have installed Spark 1.6.1 with Hadoop 2.6. I downloaded the precompiled version and just copied the extracted contents into /usr/local/spark/. I did all the environment variables setup with SCALA_HOME, changes to PATH and other spark conf. I am able to run both spark-shell and pyspark (with anaconda's python).

I have setup the standalone cluster; all the nodes are showing up on my web UI. Now, by using the python shell (ran on the cluster not locally) I followed this link's python interpreter word count example.

This is the code I have used

from operator import add

def tokenize(text):
    return text.split()

text = sc.textFile("Testing/shakespeare.txt")
words = text.flatMap(tokenize)
wc = words.map(lambda x: (x,1))
counts = wc.reduceByKey(add)

counts.saveAsTextFile("wc")

It is giving me error that the file shakespeare.txt was not found on a slave nodes. Searching around I understood that if I am not using HDFS then the file should be present on each slave node on the same path. Here is the stack trace - github gist

Now, I have a few questions-

Isn't RDD supposed to be distributed? That is, it should have distributed (when the action was run on RDD) the file on all the nodes instead of requiring me to distribute it.
I downloaded the spark with Hadoop 2.6, but any of the Hadoop commands are not available to make a HDFS. I extracted the Hadoop jar file found in the spark/lib hoping to find some executable but there was nothing. So, what Hadoop related files were provided in the spark download?
Lastly, how can I run a distributed application (spark-submit) or a distributed analysis (using pyspark) on the cluster? If I have to create a HDFS then what extra steps are required? Also, how can I create a HDFS here?

The RDD isn't going to distribute your file for you, that's why you normally would use a distributed file system like HDFS. — femibyte

Markon Markon · Accepted Answer · 2016-03-31T09:02:33

If you read the Spark Programming Guide, you will find the answer to your first question:

To illustrate RDD basics, consider the simple program below:
val lines = sc.textFile("data.txt")
val lineLengths = lines.map(s => s.length)
val totalLength = lineLengths.reduce((a, b) => a + b)
The first line defines a base RDD from an external file. This dataset is not loaded in memory or otherwise acted on: lines is merely a pointer to the file. The second line defines lineLengths as the result of a map transformation. Again, lineLengths is not immediately computed, due to laziness. Finally, we run reduce, which is an action. At this point Spark breaks the computation into tasks to run on separate machines, and each machine runs both its part of the map and a local reduction, returning only its answer to the driver program.

Remember that transformations are executed on the Spark workers (see link, slide n.21).

Regarding your second question, Spark contains only the libs, as you can see, to use the Hadoop infrastructure. You need to setup the Hadoop cluster first (Hdfs, etc etc), in order to use it (with the libs in Spark): have a look at Hadoop Cluster Setup.

To answer your last question, I hope that the official documentation helps, in particular Spark Standalone.

Spark RDD problems

1 Answers