0
votes

We are overwhelming nowadays with lots of NoSQL options and NoSQL in general. And it is trendy today to abandon/ignore RDBMS and adopt "blindly" NoSQL, considering that most startup-s/projects can deal pretty good with traditional RDBMS.

Lets start with the NoSQL definition:

NoSQL DEFINITION: Next Generation Databases mostly addressing some of the points: being non-relational, distributed, open-source and horizontally scalable. (+ schema-free (actually implicit schema, which is much worse than explicit) and eventual consistency).

NoSQL (at least the concepts of handling big data) was created by companies such as Google (BigTable), Amazon (Dynamo), Twitter and Facebook. Cassandra and Riak were born from there. It seems only MongoDB was developed on its own, without influences on the papers published by Google and Amazon.

But the most majority of the companies don't operate on a such scale. And RDBMS might be a good fit. I could not found the exact amount of data MySQL or PostgreSQL can handle with reasonable performance (at least PostgreSQL says there are 32TB DBs available PostgreSQL FAQ).

And we still can scale with RDBMS. We can do sharding quite easily (on the application level) (although shards rebalancing is more challenging, and could be an issue, maybe). We even can do replication and scale "reads" that way (considering that we write only to "master"). But in that case we have to deal with distributed challenges: replication delay and eventual consistency. We could do this just for the set of the data (just a few tables, for example), where replication delay is not an issue/big issue.

For even better performance caching can be introduced (redis or memcached).

And you should plan your querys ahead if possible, to get the best possible performance from RDBMS, and build your API on top of it, and not vice versa.

And, of course, there is not substitution for ACID in the NoSQL world, and when you need it, it much simpler to use RDBMS, than trying to invent ACID on top of NoSQL (which is due to CAP theorem, is impossible). Nice summary of PostgreSQL usage and scaling by Braintree: Scaling PostgreSQL

One more use case for RDBMS is usually you split "real-time" tables into report tables, which could have different (more flat structure) do more performant queries or you could create a separate tables/views designed for fast reads (but agree, this adds more complexity, but there are options at least).

So, what are the use cases for the NoSQL in favor of RDBMS, and what is the limit of RDBMS when NoSQL will be the more appropriate solution for the problem. What are the questions system architects should ask before choosing NoSQL.

I do believe in simplicity (although simple is not easy), and NoSQL is not simple as it might sound (there is no free lunch) (plus considering that developers already have a long history of RDBMS expertise, and they are more mature products in general), and you will have your own set of distributed challenges with NoSQL, not to mention more operational work to properly configure and monitor the cluster.

1

1 Answers

0
votes

It's pretty hard to answer this question because NoSQL, contrary to RDBMS, means nothing -- using NoSQL doesn't mean anything without saying what product you're going to use. Imagine you have to develop your NoSQL implementation of SVN and you choose Cassandra -- well now you have to implement your own file versioning, handling in every commit the fact that there might be somewhere in the past, a(possibly many) column(s) holding the previous version of the file, and that you should be able to show file history easily. After a while inspecting the world of NoSQL you discover HBase, which is "similar" to Cassandra but it offers columns-versioning for free. D'oh!

So first point is that the NoSQL product has to be chosen based on the specific application needs. Don't use a screwdriver to push a nail.

The following are personal opinions, based on my choice of Cassandra to integrate into a very-high traffic website the possibility to rate and review companies and other stuffs.

  • performances over consistency

I'm handling user's comments over companies so consistency is not a real problem. If a comment is not visible immediately after it has been published nobody would complain. I'm not overbooking a flight due to a fake-read. Contrary, since site has millions of access query should perform very fast

  • no single point of failure

Comments and users, once integrated, appeared in any page of the website, from the home-page till the company's detail page. I couldn't bring down the whole website due to a DB problem. I don't work for Datastax so believe it or not, in more than 4 years we didn't have any down (touching wood) -- the product has been chosen because it spotted the "no single point of failure" (luckily it's true!)

  • query-driven-design (O(1) 'complex' queries)

Before starting to model data I already knew the exact queries I had to do.
-- so queries like

SELECT * FROM comments where city='ROME' and vote=3 and userid='abc' ORDER BY timestamp DESC LIMIT 100

performs very fast because data are stored just to be retrieved for the specific query (that's why in NoSQL world you often hear 1 table = 1 query)

Cheers, Carlo