Why appending files are not fully loaded by MapReduce Job?

Question

I have a flume which stream data into HDFS sink (appends to same file), which I could "cat" and see it from HDFS. However, the MapReduce job is only picking up the first batch that was flushed (bacthSize = 100). The rest are not being picked up, although I could cat and see the rest. When I execute the MapRecue job after the file is rolled(closed), it's picking up all data. Do you know why MR job is failing to find the rest of the batch even though it exists.

DataHacker DataHacker · Accepted Answer · 2015-01-15T14:58:23

To my knowledge Flume (1.4 in my case) is not really appending to HDFS files at all. When the HDFS sink is started it will create a .tmp file which is 0 kb until it is rolled/renamed. The incremental records are not yet on HDFS but in the Flume agent on the channel. So until the rename event (.tmp to final filename) you will not have access to the newly arrived data. (MR FileInputFormat filters all files beginning with either '_filename' or '.filename')

Why appending files are not fully loaded by MapReduce Job?

1 Answers