What would happen to a MapReduce job if input data source keep increasing in HDFS?

Question

We have a log collection agent running with HDFS, that is, the agent(like Flume) keeps collecting logs from some applications and then writes then to HDFS. The reading and writing process are running without a break, leading the destination files of HDFS keeping increasing.

And here is the question, since the input data is changing continuously, what would happen to a MapReduce job if I set the collection agent's destination path as the job's input path?

FileInputFormat.addInputPath(job, new Path("hdfs://namenode:9000/data/collect"));

well I guess you will run out of disk space some day? What do you want to hear?:D — Thomas Jungblut
Sorry I did not make it clear. I mean that if input data keeps changing in a MR job like "word-count", what would happen to the Map function, keeping running with increasing files, or just getting a snapshot of input data and using such set to do map/reduce things? — Yohn
The input is always immutable, data is run on blocks that are determined at the start of the job. They can never change, thus there are no issues on the job or map function. — Thomas Jungblut
@ Thomas Jungblut - so my understanding is that the new data which is arrived at input location once the MR job started is not included in the processing, right ? — sras
Thanks @ThomasJungblut, but I'm still confused about that immutable words. Do you mean that HDFS cannot be written while there is some reading process? Or do you agree with the snapshot thing? — Yohn

Mikhail Golubtsov Mikhail Golubtsov · Accepted Answer · 2015-06-29T14:13:34

A map-reduce job processes only data available at the start.

Map-Reduce is for batch data processing. For continuous data processing use tools like Storm or Spark Streaming.

What would happen to a MapReduce job if input data source keep increasing in HDFS?

1 Answers