30
votes

The use case is this: I've several java applications running which all have to interact with different (each one has a specific target) elasticsearch indices. For instance an application A uses the indices A,B,C of ElasticSearch to query and update. Application B uses indices A,C,D(say).

Some common interface is required which can manage all these data streams. Currently I'm evaluating Kafka and fluentd for this purpose. Can someone explain which will be better suited for this situation. I've looked at features of both Kafka and Fluentd and I don't really understand the difference it would make here. Thanks a lot.

2

2 Answers

59
votes

kafka provides publish/subscribe messaging as a distributed commit log. Usually you install kafka on each host where you need to produce some data to be forwarded somewhere else and all those hosts will together form a cluster. The good thing here is that if for some reason network connectivity becomes unstable or goes down, your application can continue to produce data/logs and they won't be lost. Whereas if your application directly sends logs to some remote centralized logging host, you might lose some logs during the time the network goes down.

fluentd is a centralized log collector which is commonly installed on one host (or more if you need horizontal scaling). It connects to remote data sources, applies filtering and sends unified log data to remote data sinks.

From the fluentd docs, you can see that fluentd can consume data from kafka and produce data towards kafka as well. This alone should hint that fluentd and kafka are on different layers since the former uses the latter.

It would be more logical to compare fluentd and logstash actually. As far as fluentd is concerned, kafka is just another data source and/or data sink, but they are different beasts altogether.

If you want the best of both worlds, use kafka as input/output data pipes from/to your apps and fluentd (or logstash) as your centralized logging system reading from those kafka topics.

If you want to read more on the topic, you can read how fluentd and kafka complement each other very well, read they are not competing against each other.

4
votes

From: The Life Blood Of Your Data Pipeline

Kafka is primarily related to holding log data rather than moving log data. Thus, Kafka producers need to write the code to put data in Kafka, and Kafka consumers need to write the code to pull data out of Kafka.

Fluentd has both input and output plugins for Kafka so that data engineers can write less code to get data in and out of Kafka. We have many users that use Fluentd as a Kafka producer and/or consumer.