Incremental MapReduce with Hadoop and HBase

Question

I've been working with CouchDB for a while, and I'm considering doing a little academic project in HBase / Hadoop. I read some material on them, but could not find a good answer for one question:

In Both Hadoop/HBase and CouchDB use MapReduce as their main method of query. However, there is a significant difference: CouchDB does that incrementally, using views, indexing every new data that is added to the database, while Hadoop (from all the examples I saw), is typically used to perform full queries on entire data-sets. What I'm missing is the ability to use Hadoop MapReduce to build, and mainly, maintain indexes, such as CouchDB's views. I saw some examples of how MapReduce can be used for creating an initial index, but nothing about incremental updates.

I believe the main challenge here is to run the indexing job only on rows that changed since a given timestamp (the time of the last indexing job). This would make these jobs run for a short amount of time, allowing them to run frequently, keeping the index relatively up-to-date.

I expected this usage pattern to be very common, and was surprised not to see anything about it online. I already saw IndexedHbase and HbaseIndexed, which both provide secondary indexing on HBase based on non-key rows. This is not what I need. I need the programmatic ability to define the index arbitrarily, based on the contents of one or more rows.

Tariq Tariq · Accepted Answer · 2013-06-25T20:01:46

One way could be to use Timestamp as the rowkey. This will allow you to work on rows based on some given time.

Since i'm talking about rowkey based on TS, use hashing to avoid hotspotting.

Incremental MapReduce with Hadoop and HBase

1 Answers