I am building an application that includes a feature to bulk tag millions of records, more or less interactively. The user interaction is very similar to Gmail where users can tag individual emails, or bulk tag large amounts of emails. I also need quick read access to these tag memberships as well, and where the read pattern is more or less random.
Right now we're using Mysql and inserting one row for every tag-document pair. Writing millions of rows to Mysql takes a while (high I/O), even with bulk insertions and heavy optimization. We need this to be an interactive process, not a batch process.
For the data that we're storing and reading, consistency and availability of the data are not as important as performance and scalability. So in the event of system failure while the writes are occurring, I can deal with some data loss. However, the data definitely needs to be persisted to secondary storage at some point.
So, to sum up, here are the requirements:
- Low latency bulk writes of potentially tens of millions of records
- Data needs to be persisted in some way
- Low latency random reads
- Durable writes not required
- Eventual consistency is okay
Here are some solutions I've looked at:
- Write behind caches (Terracotta, Gigaspaces, Coherence) where records are written to memory and drained to the database asynchronously. These scare me a little because they appear to add a certain amount of complexity to the app that I'd want to avoid.
- Highly scalable key-value stores, like MongoDB, HBase, Tokyo Tyrant