I wanted to know: Does hadoop mapreduce re-process the entire dataset if the same job is submitted twice? For example: the word count example counts the occurrence of each word in each file in an input folder. If I were to add a file to that folder, and re-run the word count mapreduce job, will the initial files be re-read, re-maped and re-reduced?
If so, is there a way to configure hadoop to process ONLY the new files and add it to a "summary" from previous mapreduce runs.
Any thought/help will be appreciated.