Cassandra + Spark for Real time analytics

Question

I am working on an application for "Real Time Rendering of Big Data (Spatial data)". With the help of Spark Streaming + Spark SQL + WebSocket, i am able to render pre defined queries on dashboard. But i want to fetch data with interactive queries and ad hoc queries.

For that purpose i am trying to implement it with "Spark Streaming + Cassandra". These queries required aggregation and filter on huge amount of data.

I am new to Cassandra and Spark, So i am confused about below approachs, which will be better\faster:

Spark Streaming -> Filtering (Spark) -> Save to Cassandra ->Interactive Query -> UI (Dashboard)
Spark Streaming -> Filtering (Spark) -> Save to Cassandra ->Spark SQL -> Interactive Query -> UI (Dashboard)

Will Cassandra be fast enough to give result in real time ? Or should i create an RDD from Cassandra to perform interactive queries over it.

One of the query is:

"SELECT *  FROM PERFORMANCE.GEONAMES A  INNER JOIN  
(SELECT max(GEONAMEID) AS MAPINFO_ID FROM  PERFORMANCE.GEONAMES
where longitude between %LL_LONG% and %UR_LONG% 
and latitude between %LL_LAT% and %UR_LAT%  
and %WHERE_CLAUSE% GROUP BY LEFT(QUADKEY, %QUAD_TREE_LEVEL%)  )
AS B ON A.GEONAMEID = B.MAPINFO_ID"

Any inputs or suggestions will be appreciated. Thanks,

Thanks @doanduyhai for suggesting SASI secondary index, it really made a huge difference.

Performance will very much depend on the design of your tables and how many rows you need to read to answer one single request. If you're optimizing the path for the UI rendering, it would make sense to write the data in Cassandra in a shape that will give you the info you need with minimal reads. I don't think there's a general answer without looking into the specific requirements of the app. — maasg

doanduyhai doanduyhai · Accepted Answer · 2016-04-13T11:50:12

Will Cassandra be fast enough to give result in real time ? Or should i create an RDD from Cassandra to perform interactive queries over it.

It depends on how much filtering you're doing up-front and the number of machines in your cluster. If your Cassandra table has 1Tb of data and you query fetches 100Gb of data in memory, assuming a cluster of 10 machines, it means loading 1Gb in memory it's manageable but the query will never be sub-minute.

Now, if you filter enough to fetch only 100Mb total out of the Cassandra table, it means 10Mb/machine and it's possible to have latency of the order of seconds.

How to filter data early in Cassandra ?

Use the new SASI secondary index (wait for Cassandra 3.5 released this week because 2 critical bugs have been discovered)
Use DSE Search to filter early with Solr
Use Stratio Lucene secondary index

Cassandra + Spark for Real time analytics

2 Answers