I am trying to ingest using flume spooling directory to HDFS(SpoolDir > Memory Channel > HDFS).
I am using Cloudera Hadoop 5.4.2. (Hadoop 2.6.0, Flume 1.5.0).
It works well with smaller files, however it fails with larger files. Please find below my testing scenerio:
- files with size Kbytes to 50-60MBytes, processed without issue.
- files with greater than 50-60MB, it writes around 50MB to HDFS then I found flume agent unexpected exit.
- There are no error message on flume log. I found that it is trying to create the ".tmp" file (HDFS) several times, and each time writes couple of megabytes (some time 2MB, some time 45MB ) before unexpected exit. After some time, the last tried ".tmp" file renamed as completed(".tmp" removed) and the file in source spoolDir also renamed as ".COMPLETED" although full file is not written to HDFS.
In real scenerio, our files will be around 2GB in size. So, need some robust flume configuration to handle workload.
Note:
- Flume agent node is part of hadoop cluster and not a datanode (it is an edge node).
- Spool directory is local filesystem on the same server running flume agent.
- All are physical sever (not virtual).
- In the same cluster, we have twitter datafeeding with flume running fine(although very small about of data).
Please find below flume.conf file I am using here:
#############start flume.conf#################### spoolDir.sources = src-1 spoolDir.channels = channel-1 spoolDir.sinks = sink_to_hdfs1 ######## source spoolDir.sources.src-1.type = spooldir spoolDir.sources.src-1.channels = channel-1 spoolDir.sources.src-1.spoolDir = /stage/ETL/spool/ spoolDir.sources.src-1.fileHeader = true spoolDir.sources.src-1.basenameHeader =true spoolDir.sources.src-1.batchSize = 100000 ######## channel spoolDir.channels.channel-1.type = memory spoolDir.channels.channel-1.transactionCapacity = 50000000 spoolDir.channels.channel-1.capacity = 60000000 spoolDir.channels.channel-1.byteCapacityBufferPercentage = 20 spoolDir.channels.channel-1.byteCapacity = 6442450944 ######## sink spoolDir.sinks.sink_to_hdfs1.type = hdfs spoolDir.sinks.sink_to_hdfs1.channel = channel-1 spoolDir.sinks.sink_to_hdfs1.hdfs.fileType = DataStream spoolDir.sinks.sink_to_hdfs1.hdfs.path = hdfs://nameservice1/user/etl/temp/spool spoolDir.sinks.sink_to_hdfs1.hdfs.filePrefix = %{basename}- spoolDir.sinks.sink_to_hdfs1.hdfs.batchSize = 100000 spoolDir.sinks.sink_to_hdfs1.hdfs.rollInterval = 0 spoolDir.sinks.sink_to_hdfs1.hdfs.rollSize = 0 spoolDir.sinks.sink_to_hdfs1.hdfs.rollCount = 0 spoolDir.sinks.sink_to_hdfs1.hdfs.idleTimeout = 60 #############end flume.conf####################
Kindly suggest me whether there is any issue with my configuration or am I missing something.
Or is it a known issue that Flume SpoolDir cannot handle with bigger files.
Regards,
-Obaid
- I have posted the same topic to another open community, if I get solution from other one, I will update here and vice versa.