I'm developing Hadoop MapReduce streaming jobs written in Perl to process a large set of logs in Hadoop. New files are continually added to the data directory and there are 65,000 files in the directory.
Currently I'm using ls
on the directory and keeping track of what files I have processed but even the ls
takes a long time. I need to process the files in as close to real time as possible.
Using ls
to keep track seems less than optimal. Are there any tools or methods for keeping track of what logs have not been processed in a large directory like this?