We have a log collection agent running with HDFS, that is, the agent(like Flume) keeps collecting logs from some applications and then writes then to HDFS. The reading and writing process are running without a break, leading the destination files of HDFS keeping increasing.
And here is the question, since the input data is changing continuously, what would happen to a MapReduce job if I set the collection agent's destination path as the job's input path?
FileInputFormat.addInputPath(job, new Path("hdfs://namenode:9000/data/collect"));
Mapfunction, keeping running with increasing files, or just getting a snapshot of input data and using such set to do map/reduce things? - Yohnimmutablewords. Do you mean that HDFS cannot be written while there is some reading process? Or do you agree with thesnapshotthing? - Yohn