0
votes

I am running a flume agent which uses memory channel.

agent.channels.c1.type = memory
agent.channels.c1.capacity = 1000000

Source is of the type syslogtcp and sink if of the type hdfs. The agent is collecting about 1 million records every minute.

The concern I am having is that, my flume agent consumes disk space even though I am using memory channel. So if my agent is running for about a month, it uses around 300gb of my disk space which is causing the issue. So the questions are

Q1: Why disk space is consumed when running this agent which uses memory channel?

Q2: When will this space be released? Is there any conditions or should it be done manually.? Any idea which will be the location these files will be stored?

1

1 Answers

0
votes

How big are the documents? The typical block size in HDFS is 64MB, sometimes set to 128MB... so if you have a 2k doc, it still takes 64MB on disk!!!

You should set the 'batchsize' parameter to a high number to batch these events into larger files on HDFS. Of course that will also alter the speed at which the events are dumps on HDFS and the jobs after that so if you're after real time, this is not ideal. Instead of sinking to HDFS you might want to sink to HBase instead, which aggregates smaller events into a big table.