3
votes

As per the Spark documentation, if we don't 'cache' a given RDD, then every time we reference it, the business logic (the graph) behind the RDD gets evaluated. But in practice when I tried this using Spark shell I see that even if we don't cache explicitly, still the "in-memory" copy is being used. Why would Spark cache RDD when we don't ask it to? I am using standalone mode of Spark on windows, something to do with that?

So let me describe what I did. I created a simple text file as:-

key1,value1
key2,value2
key3,value3

Now from Scala shell of Spark I created an RDD as:-

val rdd = sc.textFile("sample.txt").map(line => line.split(",")).map(line => (line(0),line(1)))

Now when I perform the following action on this RDD I get value1:-

rdd.lookup("key1")

So far it's all fine. Now I open the original source file and add one more entry to it as:-

key4,value4

I save the file. Now from the same shell (I haven't yet exited the shell), I fire the following action:-

rdd.lookup("key4")

It return empty list, so basically it's saying it didn't find entry for key4. It means Spark is still using the older copy which it's obviously holding in memory. Otherwise if what you say is right, it should evaluate the complete business logic of RDD from scratch and in that case it would have acquired key4,value4. But it's totally unaware of this new line in the file. Why is this happening? I have obviously not cached the RDD yet, still it's referring to older version of file.

Thanks

3
What version of Spark did you use? - Daniel Darabos
Huh, strange! Did you experiment with a local file, or HDFS? I could not reproduce with 1.4.0 on local disk. If you can, I suggest you look at the DAG visualization on the Spark UI of the job. It should tell you whether it think it's re-reading the file. - Daniel Darabos
I am using local file only. In fact I have this installed on my Windows laptop. Let me try checking the UI as you suggested. - Dhiraj
I tried checking the DAG visualization on the spark UI for this job i.e. rdd.lookup("key4"), it shows me it's reading from file. But still the result is that it can't fetch key4,value4 even if it is added to the file after the RDD was created. Also another funny thing I noticed is that even if I cache certain RDD and call certain 'action' on it two times in a row in a shell, the DAG visualiztion for the latest job too still shows me it's reading from source file. Does it mean 'caching' doesn't work locally? - Dhiraj
Perhaps it means I just misunderstand the visualization :). Sorry, I cannot dig into this more at the moment. As a workaround I suppose you could always start from scratch with a new RDD. If there is no caching, there is no performance benefit from re-using the one RDD I think. - Daniel Darabos

3 Answers

3
votes

I can reproduce this behavior with Apache Spark 1.3.0. I wanted to reproduce it with 1.4.0 as well, since it has very good visibility into what transformations happen in a stage. But in Spark 1.4.0 rdd.lookup("key4") works!

I think this means the behavior was caused by a bug. I couldn't find the bug number.

0
votes

Are you sure you edited and upload this new text file in hdfs? I repeated your steps: upload file on hdfs, compute rdd, delete old file, upload the new one with new line and run lookup operation - it returns new result.

0
votes

This is not a bug but a feature provided by Spark shell. I was able to see the same behavior with latest Spark-1.5.0-SNAPSHOT.

Spark guys created the shell with an idea of some interactive-console for doing some fast computations on preloaded dataset. Under the core it uses Scala REPL which keeps objects in JVM once declared.

See section 4 (Interpreter Integration) http://www.cs.berkeley.edu/~matei/papers/2010/hotcloud_spark.pdf