I'm agregating my data from Cassandra using Spark and Spark-Cassandra connector. I have web application for it with single shared SparkContext and REST api. Processing has next flow:
- Read Cassandra table
- Prepare it for filtering (sequence of Spark transformations)
- Filter prepared RDD according to api call parameters
In algorithm above only third step is different for every call (depends on api request params). Api request execute in parallel mode(thread per request). As data in table isn't very dynamic and I have enough memory on my spark workers to store whole table I want to persist my RDD after second step and on every request just filter already persisted RDD. Also I want to periodically update this RDD. What is the best way to achieve it?
persist
should do the trick. – Jonathan Taws