0
votes

I am looking for a data store that serves the following needs:-

  1. distributed because we have lots of data to query (in TBs)
  2. Write intensive data store. Data will be generated from services and we want to store the data to perform analytics on them.
  3. We want the analytical queries to be reasonably fast (order of minutes, not hours)
  4. Most of our queries would be of the "Select, Filter, Aggregate, Sort" type.
  5. Schema changes often as what we store will change depending on the changing requirements of the system
  6. Part of the data that we store may also be used for pure large scale map/reduce jobs for other purposes.

Key-value stores are scalable but does not support our Query requirements.

Map/Reduce jobs are scalable and can execute the queries, but I think it will not meet our query latency requirements.

An RDBMS (like MySQL) would satisfy our query needs but it will force us to have a fixed schema. We could scale it but then we have to do sharing etc.

Commercial solutions like Vertica seem like a solution that would solve all of our problems, but I would avoid a commercial solution if I can.

HBase seems to be a system that is as scalable as Hadoop because of the underlying HDFS and seems to have the facilities to perform Filters and Aggregations, but I am not sure about the performance of Filter queries in HBase.

Currently HBase does not support Secondary indexes. This makes me wonder if HBase is a right option for Filtering on any arbitrary column. As per the documentation, Filtering on row-id and Column family is faster than filtering on just the column qualifier. However, I also read that having the Bloom Filter index on RowId and Column family significantly increases the size of the Bloom filter and makes this option practically infeasible.

I am unable to find much data online about performance of Filter queries in HBase. Hoping I can find some more information here.

Thanks!

2
Now that I think about it, it seems that SimpleDB will satisfy all the requirements. It is scalable, supports all the kinds of queries I want. the only limitation I see for SimpleDB is the domain size restriction and the fact that I have to worry about query time limitsuser855
Are you sure you want to use SimpleDB for this? Their docs say "Amazon SimpleDB is designed to store relatively small amounts of data and is optimized for fast data access and flexibility in how that data is expressed."Suman

2 Answers

0
votes

try apache cassandra, it supports Secondary Indexes very well. Coming to hbase bloom filters, please go thru this link, it describes multiple options of bloom depending on pattern, Hbase bllom filters

0
votes

You are probably looking for MPP solutions like Postgres-XL or related plateforms.