4
votes

Hadoop 2.6 uses Yarn as a next generation map reduce and also is cluster manager. Do we still need to use zookeeper with the hadoop 2.6 for cluster managing services? How do we setup zookeeper.

How does Kafka connectivity is installed for hadoop cluster. What would be the consumer and producer for kafka to send data to hadoop file system.

Where does they all fit in.

I have setup a hadoop 2.6 single node cluster. Now next, The way I understand it is to have zookeeper and Kafka for data streaming to hadoop file system. And I don't have any idea how to use kafka for hadoop or its api.

2

2 Answers

4
votes

Zookeeper is a coordination framework for distributed systems. Zookeeper is used for coordinating the state in HDFS & Yarn high availability, coordination between Hbase master and region servers etc. Kafka works in combination with Apache Storm, Apache HBase and Apache Spark for real-time analysis and rendering of streaming data. Common use cases include:

  1. Stream Processing.
  2. Website Activity tracking
  3. Metrics Collection and Monitoring
  4. Log Aggregation

Usually we use Kafka along with Storm. Storm needs a zookeeper cluster for the coordination between nimbus and supervisor. Kafka need zookeeper for storing the information about the cluster status and consumer offsets.

Basically zookeeper provides a highly available file system where users/application can read/write small data. This data can be something related to the communication or transactions. Since the file system is highly available, the communications will be always complete and will not go to a partial or unknown state. Zookeeper cluster can withstand upto certain number of failures depending upon the number of partitions(say N), it can tolerate N-1 failures. For more details, you can refer the following urls 1 2 3

2
votes

Kafka works on concept of producer/consumer where producers write to a topic and consumers consume the data from topic. Each consumers can consume data from any available partition for that topic .

Consumers of topics also register themselves in ZooKeeper, in order to coordinate with each other and balance the consumption of data.

Consumers track the maximum offset they have consumed in each partition. This value is stored in a ZooKeeper directory if offsets.storage=zookeeper. This valued is stored in a ZooKeeper directory. /consumers/[group_id]/offsets/[topic]/[broker_id-partition_id] --> offset_counter_value ((persistent node). Refer kafka documentation for more information on use of zookeeper in Kafka