0
votes

I'm agregating my data from Cassandra using Spark and Spark-Cassandra connector. I have web application for it with single shared SparkContext and REST api. Processing has next flow:

  1. Read Cassandra table
  2. Prepare it for filtering (sequence of Spark transformations)
  3. Filter prepared RDD according to api call parameters

In algorithm above only third step is different for every call (depends on api request params). Api request execute in parallel mode(thread per request). As data in table isn't very dynamic and I have enough memory on my spark workers to store whole table I want to persist my RDD after second step and on every request just filter already persisted RDD. Also I want to periodically update this RDD. What is the best way to achieve it?

1
I am not sure I understand correctly what you want. Does your RDD need to be shared between different contexts ? Otherwise a simple persist should do the trick.Jonathan Taws
@Hawknight no, I have only one context. Can I just store my persisted table inside scala object?Cortwave
Well, if your table is converted as an RDD, you can simply persist your RDD and keep a reference to the persisted RDD (technically only persisted once an action is called) variable for your subsequent calls.Jonathan Taws

1 Answers

1
votes

You can just call persist on the RDD after step 2. The RDD will be computed and cached when the first action is called. When you need to refresh the data, just call unpersist. This will cause Spark to drop the old cache, then store a new cache when the action is performed. Basically, you will do something like this.

var data = loadAndFilter()
while (!stop) {
  data.persist()
  // Do step 3

  // Drop the old cache
  data.unpersist(false)
  // Load the fresh data
  data = loadAndFilter()
}