I've manually installed a three node cluster with the following configuration:
Master/Slave Node 0 - NameNode, Secondary NameNode, JobTracker, HMaster,
DataNode, TaskTracker, HRegionServer,
Hive MetaStore, Database for Hive/Sqoop, HiveServer2, HCatalog,
Oozie Server,
Zookeeper,
Oozie-client, Hive-client, pig-client, M/R client tools, Sqoop
Slave Node 1 - DataNode, TaskTracker, HRegionServer,
Oozie-client, Hive-client, pig-client, M/R client tools, Sqoop
Slave Node 2 - DataNode, TaskTracker, HRegionServer,
Oozie-client, Hive-client, pig-client, M/R client tools, Sqoop
I wish to have a more realistic cluster. I was thinking about using 12-14 nodes for the following:
Master 0: Name Node
Master 1: Secondary NameNode
Master 2: JobTracker
Master 3: HMaster
Slave 0: DataNode, TraskTracker, HRegionServer
Slave 1: DataNode, TraskTracker, HRegionServer
Slave 2: DataNode, TraskTracker, HRegionServer
Hive/Catalog Node: Hive MetaStore,
Sqoop MetaStore
MySQL/PostgreSQL Database for Hive/Sqoop,
HCatalog,
HiveServer (Or is it better to break HiveServer into its own node?)
Oozie-Server (Or is it better to break Oozie-server into its own node?)
Zookeeper Ensemble: 3 Nodes with Zookeper installed
Client Node: Oozie-client, Hive-client, pig-client, M/R client tools, Sqoop
Or, in diagram format:
I know Cloudera likes you to have:
A separate Master Node for each Master Process (NameNode, Secondary NameNode, JobTracker, HMaster)
3 Slave nodes with DataNode, TaskTracker, and HRegionServer
3 Zookeeper Nodes
"The database, the HiveServer process, and the metastore service can all
be on the same host, but running the HiveServer process on a separate host
provides better availability and scalability."
I've used the same MySQL instance for my Hive database and my Oozie database, and figured that would be ok to do again. I'm also figuring the HiveServer and the Oozie-server can run on the same host as the Hive/Oozie MetaStore, along with HCatalog.
Right now on my three node cluster, I have installed all the client software on every node so I can execute M/R, Hive, Oozie, HBase, Pig, etc. client calls from any node. Are these client tools supposed to be executed on a node separate from the Master and Slave nodes? Speaking of which, I have been putting all my java/python/pig code on the Master Node in my three node cluster. Is this data also better put on a separate client node?
Am I on the right path here? What is the proper way to to make the smallest but ideal cluster?