How to decide the flume topology approach?

Question

I am setting up flume but very not sure of what topology to go ahead with for our use case.

We basically have two web servers which can generate logs at the speed of 2000 entries per second. Each entry of size around 137Bytes.

Currently we have used rsyslog( writing to a tcp port) to which a php script writes these logs to. And we are running a local flume agent on each webserver , these local agents listen to a tcp port and put data directly in hdfs.

So localhost:tcpport is the "flume source " and "hdfs" is the flume sink.

I am not sure about the above approach and am confused between three approaches:

Approach 1: Web Server, RSyslog & Flume Agent on each machine and a Flume collector running on the Namenode in hadoop cluster, to collect the data and dump into hdfs.

Approach 2: Web Server, RSyslog on same machine and a Flume collector (listening on a remote port for events written by rsyslog on web server)running on the Namenode in hadoop cluster, to collect the data and dump into hdfs.

Approach 3: Web Server, RSyslog & Flume Agent on same machine. And all agents writing directly to the hdfs.

Also, we are using hive, so we are writing directly into partitioned directories. So we want to think of an approach that allows us to write on Hourly partitions.

Basically I just want to know If people have used flume for similar purposes and if it is the right and reliable tool and if my approach seems sensible.

I hope that's not too vague. Any help would be appreciated.

Having two or more Flume agents write directly to HDFS won't work if you want to combine multiple event streams into a single logfile. If you want to merge streams, some agent should do it before data reaches HDFS (approach #3 is out). As for having Flume on endpoint hosts, I wouldn't do it (requires a relatively fat JVM running at all times) unless you need advanced Flume features such as event processing, loadbalancing and so on before your events reach your central cluster. — Daniel S.
Thanks for your comment. So having dedicated flume machines is the way to go? can't run them on the same machine as the web servers? In my case it would be three more machines, two for both agents and one collector. Also, what do you think about our flume source approach of :: php writing to ->rsyslog to ->Tcp port<- Flume agent(listens).From what I have read, it says rsyslog provides good reliability, in case flume agent crashes and the tcp queue fills, it starts dumping the events(if properly configured) to a local file and replays it when the flume agent starts again. — mohit_d
You can absolutely run on the same machines as web servers, if you don't mind the JVM taking some resources away from your web server daemons (this is the appraoch that Pangea suggested). If you would rather not dedicate that extra RAM to Flume on webservers, you should have a syslog agent forward it to some other machine that can aggregate events from several servers. That other machine can be a dedicated Flume machine, or you can a Flume daemon on one of your HDFS nodes if the traffic is low. — Daniel S.

Aravind Yarram Aravind Yarram · Accepted Answer · 2014-04-03T22:50:51

The typical suggestion for your problem would be to have a fan-in or converging-flow agent deployment model. (Google for "flume fan in" for more details). In this model, you would ideally have an agent on each webserver. Each of those agents forward the events to few aggregator or collector agents. The aggregator agents then forward the events to a final destination agent that writes to HDFS.

This tiered architecture allows you to simplify scaling, failover etc.

How to decide the flume topology approach?

1 Answers