0
votes

I am new in using Spark with Hadoop.

Current Scenario:

I have already configured Spark on 4 node cluster using pre-built binary "spark-1.5.2-bin-hadoop2.6".

There is also one Hadoop-2.4 cluster with 4 nodes present in my environment.

What I want:

I am planning to use Spark RDD processing using Hive HQL on the data present in hdfs in Hadoop cluster.

Queries

  1. Do I need to reconfigure spark cluster using "spark-1.5.2-bin-hadoop2.4" binary or the current one will work.

  2. Is it a good practice to work on Spark over Hadoop with Spark and Hadoop on two different clusters (but under same subnet in cloud).

2

2 Answers

0
votes

I'd say the best practice would be to run spark and hadoop on the same cluster. In fact, spark can run as a yarn application (if you do spark-submit with --master yarn client). Why ? It boils down to data locality. Data Locality is a fundamental concept of hadoop and data systems in general. The general idea is that the data you want to process it's so big so rather than moving the data, you'd rather move the program to the node the data resides on. So, in the case of spark, if you run it on a different cluster, all the data will have to be moved from a cluster to another through the network. It's more efficient to have computation and data on the same node.

As for version, having two hadoop clusters with different versions can be a pain. I'd recommend you have 2 different installations of spark, one per cluster, compiled for the appropriate version of hadoop.

0
votes

You should be using the compatible versions of spark with hadoop.

As is recently got to know, you can refer to the compatibility chart here : http://hortonworks.com/wp-content/uploads/2016/03/asparagus-chart-hdp24.png