How to install spark on a single machine(CentOS) which has single node(CentOS) Yarn cluster

Question

As a hadoop/Spark beginner, I have followed the tutorial in this website and successfully deployed a hadoop framework on my single machine(CentOS 6). Now I want to install Spark 1.2 too on the same machine, and let it work with the single-node Yarn cluster on my machine, which means execute Spark SQL on file that stored on hdfs on my single machine and output result to hdfs. I didn't find a good tutorial for this scenario online for the rest steps required.

What I did so far are:
(1) downloaded scala 2.9.3 from Scala official website and installed. "scala -version" command works!
(2) downloaded Spark 1.2.1(pre-built for Hadoop 2.4 or later) from Apache Spark website and untar-ed it already.

What to do next? How to change which config file in Spark directory? Can someone give a step by step tutorial? Especially how to configure the spark-env.sh. The more detailed the better. Thanks! (If you have questions on how I configured my hadoop and yarn, I followed exactly the steps listed in that website I mentioned before)

Jit B Jit B · Accepted Answer · 2015-03-19T10:09:07

If you want to use YARN then you have to compile spark using maven. There are various parameters depending on what support you want (hadoop version, hive compatibility, etc.). Here is the link with the parameter details: http://spark.apache.org/docs/1.2.1/building-spark.html

Here is the command that I used to install spark with hive support on Apache Hadoop 2.6.0:

mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4 -Phive -Phive-0.12.0 -Phive-thriftserver -DskipTests clean package

For running a single node cluster, you don't need to change spark-env.sh. Simply setting HADOOP_CONF_DIR or YARN_CONF_DIR in your environment is sufficient. For non-yarn mode you don't even need that. spark-env.sh allows setting the various environment variables in a single place so you can put your hadoop config, memory tuning settings, etc in a single place. The template is quite well documented.

Just fire up the cluster components using the scripts from sbin directory (usually start-all.sh is enough). One more point - if you want your sparkSQL to use Hive metastore then you have to put the hive-site.xml in conf directory with the value for hive.metastore.uris set to point to your metastore server.

How to install spark on a single machine(CentOS) which has single node(CentOS) Yarn cluster

1 Answers