0
votes

My data set has a requirement of having dynamically adding more fields for specific set of records which are not required for other records.

HBase/ No Sql systems offer me the flexibility of using dynamic schemas. But I also have the requirement of storing the data in a normalized fashion to conserve more disk space. Mostly my operations on the data store will be parallel writes and unique key based reads. I also have a requirement of dumping out the entire data set every day. The writes are not transactional in nature and the data store should be able to support aggregation for analytics.

The records count is in the order of 1Million to 10 Million with total size from 500 MB to 10GB. We would also like to open up the data for easy querying by using SQL as query language.

With these requirements, I find part of them are satisfied by No SQL based systems whereas others are satisfied by SQL based systems.

With these requirements, my concerns are,

1. Dynamic schema: Dynamic schema is well supported by No SQL based system. That doesn't mean that it cannot be supported by SQL based system. It is still possible to normalize the dynamic data into another table and use an unique identifier to refer that from the parent table reducing the number of Null entries in the parent table.

2. SQL based query language: RDBMS is meant for this. But still No SQL systems like hbase have Hive, Phoenix to resolve this. Is there any clear winner in this case?

3. Analytics support: NO SQL systems are well suited for OLAP based operations. RDBMS systems are well suited for OLTP based operations. Here the requirement is to perform more OLAP operations. What is missing in RDBMS systems that it is not efficient to perform OLAP operations?

4. Data size As mentioned before, the data size is relatively small and can fit into a single machine. Here is RDBMS based systems like My sql a winner because of easy set up compared to their No SQL counterparts? Is there any other reason for preferring RDBMS in this case other than easy set up?

5. Support for storing binary data Apart from these above requirements, we would also like to store byte objects in the data store. RDBMS based systems offer blobs to support this. But time and again I happen to read that, Blobs are not efficient for frequent changes in data, the overhead of doing cleanup of blobs is more, etc. So in this case, how does No SQL based systems like hbase handle it better than their SQL counterparts?

6. Dumping out the entire data set every day once In both RDBMS based system and No SQL based system, we are trying to save more space and we are not concerned with dumping operation taking few extra minutes to do denormalization. In this case, what would be the winner?

Based on these main factors for the current use case, I find bit difficult to pick a clear winner between RDBMS based system vs No SQL based system. Your inputs on these plz..

Thanks, Sriram

1

1 Answers

1
votes

That's not a question. It is a set of questions. To answer all those questions, I would need to write a book, which I do not intend to do. As a result, I will sum up my thoughts. Let's suppose you are using RDBMS and you have a table foo which might have some custom fields. The RDBMS solution to support them without adding those columns to foo is to create a table foo_custom_fields(id, name) to hold all the possible custom fields and have a foo_custom_field_values(id, foo_custom_field_id, foo_id, value). So, you can achieve roughly the same with RDBMS, as with NoSQL. NoSQL is very flexible, which is its advantage. There are more pros and cons for both cases, but I believe that:

if you need to write SQL, then do not use NoSQL.