I've been working with CouchDB for a while, and I'm considering doing a little academic project in HBase / Hadoop. I read some material on them, but could not find a good answer for one question:
In Both Hadoop/HBase and CouchDB use MapReduce as their main method of query. However, there is a significant difference: CouchDB does that incrementally, using views, indexing every new data that is added to the database, while Hadoop (from all the examples I saw), is typically used to perform full queries on entire data-sets. What I'm missing is the ability to use Hadoop MapReduce to build, and mainly, maintain indexes, such as CouchDB's views. I saw some examples of how MapReduce can be used for creating an initial index, but nothing about incremental updates.
I believe the main challenge here is to run the indexing job only on rows that changed since a given timestamp (the time of the last indexing job). This would make these jobs run for a short amount of time, allowing them to run frequently, keeping the index relatively up-to-date.
I expected this usage pattern to be very common, and was surprised not to see anything about it online. I already saw IndexedHbase and HbaseIndexed, which both provide secondary indexing on HBase based on non-key rows. This is not what I need. I need the programmatic ability to define the index arbitrarily, based on the contents of one or more rows.