5
votes

I am working on an application for "Real Time Rendering of Big Data (Spatial data)". With the help of Spark Streaming + Spark SQL + WebSocket, i am able to render pre defined queries on dashboard. But i want to fetch data with interactive queries and ad hoc queries.

For that purpose i am trying to implement it with "Spark Streaming + Cassandra". These queries required aggregation and filter on huge amount of data.

I am new to Cassandra and Spark, So i am confused about below approachs, which will be better\faster:

  1. Spark Streaming -> Filtering (Spark) -> Save to Cassandra ->Interactive Query -> UI (Dashboard)
  2. Spark Streaming -> Filtering (Spark) -> Save to Cassandra ->Spark SQL -> Interactive Query -> UI (Dashboard)

Will Cassandra be fast enough to give result in real time ? Or should i create an RDD from Cassandra to perform interactive queries over it.

One of the query is:

"SELECT *  FROM PERFORMANCE.GEONAMES A  INNER JOIN  
(SELECT max(GEONAMEID) AS MAPINFO_ID FROM  PERFORMANCE.GEONAMES
where longitude between %LL_LONG% and %UR_LONG% 
and latitude between %LL_LAT% and %UR_LAT%  
and %WHERE_CLAUSE% GROUP BY LEFT(QUADKEY, %QUAD_TREE_LEVEL%)  )
AS B ON A.GEONAMEID = B.MAPINFO_ID"

Any inputs or suggestions will be appreciated. Thanks,

Thanks @doanduyhai for suggesting SASI secondary index, it really made a huge difference.

2
Performance will very much depend on the design of your tables and how many rows you need to read to answer one single request. If you're optimizing the path for the UI rendering, it would make sense to write the data in Cassandra in a shape that will give you the info you need with minimal reads. I don't think there's a general answer without looking into the specific requirements of the app. - maasg

2 Answers

2
votes

Will Cassandra be fast enough to give result in real time ? Or should i create an RDD from Cassandra to perform interactive queries over it.

It depends on how much filtering you're doing up-front and the number of machines in your cluster. If your Cassandra table has 1Tb of data and you query fetches 100Gb of data in memory, assuming a cluster of 10 machines, it means loading 1Gb in memory it's manageable but the query will never be sub-minute.

Now, if you filter enough to fetch only 100Mb total out of the Cassandra table, it means 10Mb/machine and it's possible to have latency of the order of seconds.

How to filter data early in Cassandra ?

  1. Use the new SASI secondary index (wait for Cassandra 3.5 released this week because 2 critical bugs have been discovered)
  2. Use DSE Search to filter early with Solr
  3. Use Stratio Lucene secondary index
1
votes

which will be better\faster

As always it depends, like on what kind of queries you want to be able to run. Cassandra imposes some limitations based on how you set up your tables; for instance, you can't use non-indexed columns in where clauses. If you use Spark inbetween to load the data, you can perform any queries That Spark allows.

If you use SparkSQL/DataFrames, and perform a query that Cassandra does allow, Spark will push it down and you get performance similar to using Cassandra directly. Otherwise, it will have to load all data from Cassandra and then apply whatever filters you have specified, which is obviously much less efficient.

So overall you could use Spark, but it will allow users to run very slow queries. So you should still be aware of how you structure the data in Cassandra to increase the possibility of predicate push down for raw Cassandra performance.