I am looking for a data store that serves the following needs:-
- distributed because we have lots of data to query (in TBs)
- Write intensive data store. Data will be generated from services and we want to store the data to perform analytics on them.
- We want the analytical queries to be reasonably fast (order of minutes, not hours)
- Most of our queries would be of the "Select, Filter, Aggregate, Sort" type.
- Schema changes often as what we store will change depending on the changing requirements of the system
- Part of the data that we store may also be used for pure large scale map/reduce jobs for other purposes.
Key-value stores are scalable but does not support our Query requirements.
Map/Reduce jobs are scalable and can execute the queries, but I think it will not meet our query latency requirements.
An RDBMS (like MySQL) would satisfy our query needs but it will force us to have a fixed schema. We could scale it but then we have to do sharing etc.
Commercial solutions like Vertica seem like a solution that would solve all of our problems, but I would avoid a commercial solution if I can.
HBase seems to be a system that is as scalable as Hadoop because of the underlying HDFS and seems to have the facilities to perform Filters and Aggregations, but I am not sure about the performance of Filter queries in HBase.
Currently HBase does not support Secondary indexes. This makes me wonder if HBase is a right option for Filtering on any arbitrary column. As per the documentation, Filtering on row-id and Column family is faster than filtering on just the column qualifier. However, I also read that having the Bloom Filter index on RowId and Column family significantly increases the size of the Bloom filter and makes this option practically infeasible.
I am unable to find much data online about performance of Filter queries in HBase. Hoping I can find some more information here.
Thanks!