I am trying to setup all projects from Apache Hadoop stack in one cluster. What is the sequence of setting up apache hadoop ecosystem frameworks. E.g: Hadoop, HBase, ... And if you tested with some specific set of steps can you tell what kind of problems can be faced during deployment. Main frameworks for deployment (Hadoop, HBase, Pig, Hive, HCatalog, Mahout, Giraph, ZooKeeper, Oozie, avro, sqoop, mrunit, crunch, please add if I miss something)
3 Answers
There are different orders since not all listed products are dependent.
In a nutshell:
1. Hadoop (HDFS, MapReduce)
2. Pig, Hive, sqoop, Oozie
2. Zookeeper (needed for HBase)
3. HBase
I am not 100% sure abou Mahout, MRUnit dependencies, but I think that Hadoop only if needed.
Avro is not directly dependent on hadoop - it is serialization library.
I would say that the deployment is done based on primary requirement and based on requirement you will choose what other components are needed. I consider Hadoop setup as below: 1. Hadoop Core (Hadoop Common + HDFS + MapReduce -> One single big component) 2. Hadoop Components (depend on choice)
For example you setup just 1) you still can run MapReduce jobs while coping your data to HDFS. I hope you got my point.
Now for example you would want to do data analysis work using Hive and Pig and for that your can setup Hive and Pig top of it.
At the same time you decided to connect this Hadoop cluster with SQL Server/SQL Azure so you can import data from SQL Server/SQL Azure to HDFS. For this you can setup HiveODBC and Sqoop which will give you functionality to import/export data to and from HDFS to SQL Server/Azure. HiveODBC and Sqoop give you functionality to connect your on premise Excel and Power Pivot to HDFS directly and get the hive tables from there.
IF you want to setup a no-SQL database sitting top of HDFS, you sure can choose HBASE which will sit top of HDFS and you can run MapReduce jobs on it.
And so on depend on your requirement you either create a list what is needed and setup in your cluster/clusters. There is no hard and fast rule what is need as long as you have base Hadoop core (see above) is there, rest can be done top of any core setup.
I
Two interesting open source projects which you might find interesting and which might help you providing you with guidance and ideas are:
- Apache Whirr - http://whirr.apache.org/
- Apache Bigtop - http://incubator.apache.org/bigtop/
Look at what they do/use to deploy the projects you mentioned and then ask yourself: "do you really need to do it yourself/differently?" ;-)