Cassandra + Solr/Hadoop/Spark - Choosing the right tools

Question

I'm currently investigating how to store and analyze enriched time based data with up to 1000 columns per line. At the moment Cassandra together with either Solr, Hadoop or Spark offered by Datastax Enterprise seem to fulfill my requirements on the rough. But the devil is in the detail.

Out of the 1000 columns about 60 are used for real-time-like queries (web-frontend, user sends form and expect quick response). These queries are more or less GROUPBY statements where the number or occurrences are counted.

As Cassandra itself does not provide the required analytical capabilities (no GROUPBY), I'm left these alternatives:

Roughly query via Cassandra and filter the resultset within self-written code
Index the data with Solr and run facet.pivot queries
Use either Hadoop or Spark and run the queries

The first approach seems cumbersome and prone to errors… Solr does have some anayltic features but without multifield grouping I'm stuck with pivots. I don't know whether this is a good or performant approach though… Last but not least there are Hadoop and Spark, the prior known not to be the best for real-time queries, the later pretty new and maybe not production ready.

So which way to go? There is no one-fits-all here, but before I go one way through I'd like to get some feedback. Maybe I'm thinking to complex or my expectations are too high :S

Thanks in advance,

Arman

Hi, I'm just curious to know if you used any special strategy, at the end, please? Thanks. — tarilabs
Unfortunately, no. Partly, because the project changed midway... Since my post here, Solr and Spark received many updates. The Solr way works fine, when the index is intact which is hard... Spark on the other hand should do the job better than Hadoop, but I had not time to check it. — Arman

aleck aleck · Accepted Answer · 2016-03-19T11:52:39

In a place I work now we have a similar set of tech requirements and a solution is Cassandra-Solr-Spark, exactly in that order.

So if a query can be "covered" by Cassandra indices - good, if not - it's covered by Solr. For testing & less often queries - Spark (Scala, no SparkSQL due to old version of it -- it's a bank, everything should be tested and matured, from cognac to software, argh).

Generally I agree with the solution, though sometimes I have a feeling that some client's requests should NOT be taken seriously at all, saving us from loads of weird queries :)

Cassandra + Solr/Hadoop/Spark - Choosing the right tools

2 Answers