Hadoop MapReduce streaming - Best methods to ensure I have processed all log files

Question

I'm developing Hadoop MapReduce streaming jobs written in Perl to process a large set of logs in Hadoop. New files are continually added to the data directory and there are 65,000 files in the directory.

Currently I'm using ls on the directory and keeping track of what files I have processed but even the ls takes a long time. I need to process the files in as close to real time as possible.

Using ls to keep track seems less than optimal. Are there any tools or methods for keeping track of what logs have not been processed in a large directory like this?

can you please explain the question in a more elaborate way? with some formatting to understand. If the question is already answered then please post the solution. — Jagadish Talluri
In the end I had to work with the team responsible for processing the logs into Hadoop. They have created a simple database to track the incoming logs which can then be used to determine which logs have not been processed. — Kevin

Mohammed Niaz Mohammed Niaz · Accepted Answer · 2014-09-04T04:25:11

You can rename the log files once processed by your program.

For example:
    command: hadoop fs -mv numbers.map/part-00000 numbers.map/data

Once renamed, you can easily identify you processed ones and yet to be processed ones.

Thought this would fix your issue.

Hadoop MapReduce streaming - Best methods to ensure I have processed all log files

1 Answers