I would like to synchronize data to a Hadoop filesystem. This data is intended to be used as input for a scheduled MapReduce job.
This example might explain more:
Lets say I have an input stream of documents which contain a bunch of words, these words are needed as input for a MapReduce WordCount job. So, for each document, all words should be parsed out and uploaded to the filesystem. However, if the same document arrives from the input stream again, I only want the changes to be uploaded (or deleted) from the filesystem.
How should the data be stored; should I use HDFS or HBase? The amount of data is not very large, maybe a couple of GB.
Is it possible to start scheduled MapReduce jobs with input from HDFS and/or HBase?