Flume Spooling Directory Source: Cannot load files larger files

Question

I am trying to ingest using flume spooling directory to HDFS(SpoolDir > Memory Channel > HDFS).

I am using Cloudera Hadoop 5.4.2. (Hadoop 2.6.0, Flume 1.5.0).

It works well with smaller files, however it fails with larger files. Please find below my testing scenerio:

files with size Kbytes to 50-60MBytes, processed without issue.
files with greater than 50-60MB, it writes around 50MB to HDFS then I found flume agent unexpected exit.
There are no error message on flume log. I found that it is trying to create the ".tmp" file (HDFS) several times, and each time writes couple of megabytes (some time 2MB, some time 45MB ) before unexpected exit. After some time, the last tried ".tmp" file renamed as completed(".tmp" removed) and the file in source spoolDir also renamed as ".COMPLETED" although full file is not written to HDFS.

In real scenerio, our files will be around 2GB in size. So, need some robust flume configuration to handle workload.

Note:

Flume agent node is part of hadoop cluster and not a datanode (it is an edge node).
Spool directory is local filesystem on the same server running flume agent.
All are physical sever (not virtual).
In the same cluster, we have twitter datafeeding with flume running fine(although very small about of data).

Please find below flume.conf file I am using here:

#############start flume.conf####################

spoolDir.sources = src-1

spoolDir.channels = channel-1

spoolDir.sinks = sink_to_hdfs1

######## source


spoolDir.sources.src-1.type = spooldir

spoolDir.sources.src-1.channels = channel-1

spoolDir.sources.src-1.spoolDir = /stage/ETL/spool/

spoolDir.sources.src-1.fileHeader = true

spoolDir.sources.src-1.basenameHeader =true

spoolDir.sources.src-1.batchSize = 100000

######## channel
spoolDir.channels.channel-1.type = memory

spoolDir.channels.channel-1.transactionCapacity = 50000000

spoolDir.channels.channel-1.capacity = 60000000

spoolDir.channels.channel-1.byteCapacityBufferPercentage = 20

spoolDir.channels.channel-1.byteCapacity = 6442450944

######## sink 
spoolDir.sinks.sink_to_hdfs1.type = hdfs

spoolDir.sinks.sink_to_hdfs1.channel = channel-1

spoolDir.sinks.sink_to_hdfs1.hdfs.fileType = DataStream

spoolDir.sinks.sink_to_hdfs1.hdfs.path = hdfs://nameservice1/user/etl/temp/spool

spoolDir.sinks.sink_to_hdfs1.hdfs.filePrefix = %{basename}-

spoolDir.sinks.sink_to_hdfs1.hdfs.batchSize = 100000

spoolDir.sinks.sink_to_hdfs1.hdfs.rollInterval = 0

spoolDir.sinks.sink_to_hdfs1.hdfs.rollSize = 0

spoolDir.sinks.sink_to_hdfs1.hdfs.rollCount = 0

spoolDir.sinks.sink_to_hdfs1.hdfs.idleTimeout = 60

#############end flume.conf####################

Kindly suggest me whether there is any issue with my configuration or am I missing something.

Or is it a known issue that Flume SpoolDir cannot handle with bigger files.

Regards,

-Obaid

I have posted the same topic to another open community, if I get solution from other one, I will update here and vice versa.

Obaid Obaid · Accepted Answer · 2016-05-06T12:03:28

I have tested flume with several size files and finally come up with conclusion that "flume is not for larger size files".

So, finally I have started using HDFS NFS Gateway. This is really cool and now I do not even need a spool directory in local storage. Pushing file directly to nfs mounted HDFS using scp.

Hope it will help some one who is facing same issue like me.

Thanks, Obaid

Flume Spooling Directory Source: Cannot load files larger files

2 Answers