I am working on translating an existing time series database system to MapReduce model using Hadoop. The database system has both historical and real-time processing capabilities. So far, I was able to translate the batch processing functionality into Hadoop.
Unfortunately, when it comes to real-time processing, I see there are some conceptual inconsistencies with the MapReduce model.
I can write my own implementation of Hadoop's InputFormat interface which will continuously feed mappers with new data so that mappers can process and continuously emit data. However, because no reduce() method is being called until all mappers have completed their execution, my computation is bound to be stuck at the mapping stage.
I've seen some posts mentioning mapred.reduce.slowstart.completed.maps, but, as I understand, this only controls when reducers will start pulling data to their local destinations (shuffling) -- the actual reduce method is called only after all mappers have completed their execution.
Of course, there is an option of mimicking a continuous execution by processing small batches of data over small time-intervals using a continuous stream of separate MR-jobs, but this would introduce additional latencies which is not acceptable in my case.
I've also considered using Storm or S4, but before moving any further I need to make sure that this falls out of the scope of Hadoop.
In summary, it looks like people have been able to develop real-time Hadoop applications (such as Impala) or real-time processing solutions built on top of Hadoop. The question is how?