Unexpected spark caching behavior

Question

I've got a spark program that essentially does this:

def foo(a: RDD[...], b: RDD[...]) = {
  val c = a.map(...)
  c.persist(StorageLevel.MEMORY_ONLY_SER)
  var current = b
  for (_ <- 1 to 10) {
    val next = some_other_rdd_ops(c, current)
    next.persist(StorageLevel.MEMORY_ONLY)
    current.unpersist()
    current = next
  }
  current.saveAsTextFile(...)
}

The strange behavior that I'm seeing is that spark stages corresponding to val c = a.map(...) are happening 10 times. I would have expected that to happen only once because of the immediate caching on the next line, but that's not the case. When I look in the "storage" tab of the running job, very few of the partitions of c are cached.

Also, 10 copies of that stage immediately show as "active". 10 copies of the stage corresponding to val next = some_other_rdd_ops(c, current) show up as pending, and they roughly alternate execution.

Am I misunderstanding how to get Spark to cache RDDs?

Edit: here is a gist containing a program to reproduce this: https://gist.github.com/jfkelley/f407c7750a086cdb059c. It expects as input the edge list of a graph (with edge weights). For example:

a   b   1000.0
a   c   1000.0
b   c   1000.0
d   e   1000.0
d   f   1000.0
e   f   1000.0
g   h   1000.0
h   i   1000.0
g   i   1000.0
d   g   400.0

Lines 31-42 of the gist correspond to the simplified version above. I get 10 stages corresponding to line 31 when I would only expect 1.

I think your expectation is right. Maybe there is something fishy with the code? Could you provide an example with which we can reproduce the problem? One possible explanation would be that as you keep putting stuff in the cache, it pushes out c. I'm not sure that's the case though. — Daniel Darabos
Daniel's guess that the cache is getting evicted is valid. Also, some_other_rdd_ops is a blackbox to us...so that could be doing something unexpected. — Justin Pihony
I would look more into the current.unpersist() statement you have. Are you sure that c never becomes current? — marios
@marios, yes, I am sure. c and current have different types anyway. @JustinPihony, some_other_rdd_ops is: c.join(current.map(...)).aggregateByKey(...).mapValues(...). No persist/unpersist, collect, saveToTextFile, etc. — Joe K
@DanielDarabos Sure, I added a fully-executable example to reproduce this. Sorry it's a bit more complicated; that's why I originally posted the simplified version. — Joe K

Michael Mior Michael Mior · Accepted Answer · 2018-03-07T17:29:12

The problem here is that calling cache is lazy. Nothing will be cached until an action is triggered and the RDD is evaluated. All the call does is set a flag in the RDD to indicate that it should be cached when evaluated.

Unpersist however, takes effect immediately. It clears the flag indicating that the RDD should be cached and also begins a purge of data from the cache. Since you only have a single action at the end of your application, this means that by the time any of the RDDs are evaluated, Spark does not see that any of them should be persisted!

I agree that this is surprising behaviour. The way that some Spark libraries (including the PageRank implementation in GraphX) work around this is by explicitly materializing each RDD between the calls to cache and unpersist. For example, in your case you could do the following:

def foo(a: RDD[...], b: RDD[...]) = {
  val c = a.map(...)
  c.persist(StorageLevel.MEMORY_ONLY_SER)
  var current = b
  for (_ <- 1 to 10) {
    val next = some_other_rdd_ops(c, current)
    next.persist(StorageLevel.MEMORY_ONLY)
    next.foreachPartition(x => {}) // materialize before unpersisting
    current.unpersist()
    current = next
  }
  current.saveAsTextFile(...)
}

Unexpected spark caching behavior

2 Answers