Hadoop - Cloudera MRV1 Cluster Planning - What is the Minimum Number of Nodes for an Ideal Cluster, and How does it look?

Question

I've manually installed a three node cluster with the following configuration:

Master/Slave Node 0 - NameNode, Secondary NameNode, JobTracker, HMaster, 
    DataNode, TaskTracker, HRegionServer, 
    Hive MetaStore, Database for Hive/Sqoop, HiveServer2, HCatalog, 
    Oozie Server, 
    Zookeeper, 
    Oozie-client, Hive-client, pig-client, M/R client tools, Sqoop

Slave Node 1 - DataNode, TaskTracker, HRegionServer,  
    Oozie-client, Hive-client, pig-client, M/R client tools, Sqoop

Slave Node 2 - DataNode, TaskTracker, HRegionServer,  
    Oozie-client, Hive-client, pig-client, M/R client tools, Sqoop

I wish to have a more realistic cluster. I was thinking about using 12-14 nodes for the following:

Master 0: Name Node
Master 1: Secondary NameNode
Master 2: JobTracker
Master 3: HMaster

Slave 0: DataNode, TraskTracker, HRegionServer
Slave 1: DataNode, TraskTracker, HRegionServer
Slave 2: DataNode, TraskTracker, HRegionServer

Hive/Catalog Node: Hive MetaStore, 
    Sqoop MetaStore
    MySQL/PostgreSQL Database for Hive/Sqoop, 
    HCatalog, 
    HiveServer (Or is it better to break HiveServer into its own node?)
    Oozie-Server (Or is it better to break Oozie-server into its own node?)

Zookeeper Ensemble: 3 Nodes with Zookeper installed

Client Node: Oozie-client, Hive-client, pig-client, M/R client tools, Sqoop

Or, in diagram format:

enter image description here

I know Cloudera likes you to have:

A separate Master Node for each Master Process (NameNode, Secondary NameNode, JobTracker, HMaster)
3 Slave nodes with DataNode, TaskTracker, and HRegionServer
3 Zookeeper Nodes
"The database, the HiveServer process, and the metastore service can all 
be on the same host, but running the HiveServer process on a separate host 
provides better availability and scalability."

I've used the same MySQL instance for my Hive database and my Oozie database, and figured that would be ok to do again. I'm also figuring the HiveServer and the Oozie-server can run on the same host as the Hive/Oozie MetaStore, along with HCatalog.

Right now on my three node cluster, I have installed all the client software on every node so I can execute M/R, Hive, Oozie, HBase, Pig, etc. client calls from any node. Are these client tools supposed to be executed on a node separate from the Master and Slave nodes? Speaking of which, I have been putting all my java/python/pig code on the Master Node in my three node cluster. Is this data also better put on a separate client node?

Am I on the right path here? What is the proper way to to make the smallest but ideal cluster?

mwebster mwebster · Accepted Answer · 2014-01-07T04:39:37

Your setup looks pretty standard for the most part. Unfortunately there isn't an "ideal" cluster, it all depends on your workload. If you need a lot of computation, probably better to go heavier on MapReduce components. If you only plan on using HBase for low latency access then you may want to forgo MapReduce entirely.

There are a few general suggestions I would make to your set up.

You can colocate the RegionServers with the Zookeeper nodes, just give the Zookeeper nodes their own disk.
Be careful co-locating the TaskTrackers and RegionServers, especially if most of your HBase usage is scan heavy. Both processes are quite CPU and memory intensive and can lead to resource contention issues. This page has more details on what to do in this situation

As far as code organization and client set up goes,that's really your call. I personally prefer setting up a few gateway nodes which have all of the configuration for talking to hive, hbase, etc. and running jobs from there, but again there is no perfect answer for that.

Hadoop - Cloudera MRV1 Cluster Planning - What is the Minimum Number of Nodes for an Ideal Cluster, and How does it look?

1 Answers