1
votes

We are exploring SPARK for cassandra in order to over come limitations with CQL.

We were initially restricted to CQL but faced few road blocks/hurdles over RDBMS. To name a few as below

  1. For comparing >(Greater than) and < (Less than) on a column, we are restricted to have the columns in Clustering key. Even If I have a column in Clustering, I should still provide the Partition key to do < or > on clustering key.
  2. Can't check for NULL on any column value
  3. In order to query on any column other Partition key, we have to create index on that column
  4. ORDER BY a column which isn't a CLUSTERING KEY
  5. GROUP BY Limitations
  6. Join Tables

I am a newbie with cassandra and end up in revisiting my schema often due to the limitations.

Hence similar to HIVE/PIG for HDFS, What additional benefits does Spark give over CQL ?

1

1 Answers

2
votes

CQL is not a replacement for SQL. It is really designed for pulling out values from a few, usually one, partition key, and as you pointed out, does not do any sort of aggregation, grouping, very limited sorting, etc. (though Cassandra 3.0 will have UDFs and UDAs).

Here is what Spark offers over CQL:

  • General aggregation and querying via DataFrames and SQL, including JOINs, GROUP BY, ORDER BY, and UDFs
  • Significantly faster queries -- orders of magnitude faster -- if you cache the Cassandra data in memory using sqlContext.cacheTable
  • Integrated machine learning, statistics, graph processing, and virtually any kind of distributed computation you can imagine, using Scala, Java, Python, and R APIs
  • Ability to ETL in and out of Cassandra tables from and to many other data sources - including various HDFS formats, Amazon S3, DBMSes, Mongo, and most other databases today

Spark is really a completely different beast from CQL. It offers complex analytics over vast quantities of data, CQL doesn't. However, there are some limitations as well:

  • Spark is not good at highly concurrent queries. For that, you want to keep queries simple and use CQL to pull out a very small amount of data.
  • Caching data in Spark is not HA and cannot update as you write new data into C*

If you want very fast analytical queries over Cassandra with support for updates and no need to cache, then check out my project http://github.com/tuplejump/FiloDB.