As per the Spark documentation, if we don't 'cache' a given RDD, then every time we reference it, the business logic (the graph) behind the RDD gets evaluated. But in practice when I tried this using Spark shell I see that even if we don't cache explicitly, still the "in-memory" copy is being used. Why would Spark cache RDD when we don't ask it to? I am using standalone mode of Spark on windows, something to do with that?
So let me describe what I did. I created a simple text file as:-
key1,value1
key2,value2
key3,value3
Now from Scala shell of Spark I created an RDD as:-
val rdd = sc.textFile("sample.txt").map(line => line.split(",")).map(line => (line(0),line(1)))
Now when I perform the following action on this RDD I get value1:-
rdd.lookup("key1")
So far it's all fine. Now I open the original source file and add one more entry to it as:-
key4,value4
I save the file. Now from the same shell (I haven't yet exited the shell), I fire the following action:-
rdd.lookup("key4")
It return empty list, so basically it's saying it didn't find entry for key4. It means Spark is still using the older copy which it's obviously holding in memory. Otherwise if what you say is right, it should evaluate the complete business logic of RDD from scratch and in that case it would have acquired key4,value4. But it's totally unaware of this new line in the file. Why is this happening? I have obviously not cached the RDD yet, still it's referring to older version of file.
Thanks