I am running a very simple program which counts words in a S3 Files
JavaRDD<String> rdd = sparkContext.getSc().textFile("s3n://" + S3Plugin.s3Bucket + "/" + "*", 10);
JavaRDD<String> words = rdd.flatMap(s -> java.util.Arrays.asList(s.split(" ")).iterator()).persist(StorageLevel.MEMORY_AND_DISK_SER());
JavaPairRDD<String, Integer> pairs = words.mapToPair(s -> new Tuple2<String, Integer>(s, 1)).persist(StorageLevel.MEMORY_AND_DISK_SER());
JavaPairRDD<String, Integer> counts = pairs.reduceByKey((a, b) -> a + b).persist(StorageLevel.MEMORY_AND_DISK_SER());
//counts.cache();
Map m = counts.collectAsMap();
System.out.println(m);
After running the program multiple times, I can see multiple entries
This means that everytime I am running the process it keeps on creating new cache.
The time taken for running the script everytime remains the same.
Also when I run the program, I always see these kind of logs
[Stage 12:===================================================> (9 + 1) / 10]
My understanding was that when we cache Rdds, it wont do the operations again and fetch the data from the cache.
So I need to understand that why Spark doesnt use the cached rdd and instead creates a new cached entry when the process is run again.
Does spark allows to use cached rdds across Jobs or is it available only in the current context