0
votes

I am running a very simple program which counts words in a S3 Files

 JavaRDD<String> rdd = sparkContext.getSc().textFile("s3n://" + S3Plugin.s3Bucket + "/" + "*", 10);

    JavaRDD<String> words = rdd.flatMap(s -> java.util.Arrays.asList(s.split(" ")).iterator()).persist(StorageLevel.MEMORY_AND_DISK_SER());
    JavaPairRDD<String, Integer> pairs = words.mapToPair(s -> new Tuple2<String, Integer>(s, 1)).persist(StorageLevel.MEMORY_AND_DISK_SER());
    JavaPairRDD<String, Integer> counts = pairs.reduceByKey((a, b) -> a + b).persist(StorageLevel.MEMORY_AND_DISK_SER());
    //counts.cache();

    Map m = counts.collectAsMap();

    System.out.println(m);

After running the program multiple times, I can see multiple entries

Storage entries

This means that everytime I am running the process it keeps on creating new cache.

The time taken for running the script everytime remains the same.

Also when I run the program, I always see these kind of logs

[Stage 12:===================================================>     (9 + 1) / 10]

My understanding was that when we cache Rdds, it wont do the operations again and fetch the data from the cache.

So I need to understand that why Spark doesnt use the cached rdd and instead creates a new cached entry when the process is run again.

Does spark allows to use cached rdds across Jobs or is it available only in the current context

2
RDD are cached within the current SparkContext, when you execute the same script multiple time, even on the same data, you still create different SparkContext so your cache is no longer valid. It is useful only if you use the same RDD multiple time within the same SparkContext. There's nothing unexpected in this behaviour.drstein

2 Answers

0
votes

Cached data only persists for the length of your Spark application. If you run the application again, you will not be able to make use of cached results from previous runs of the application.

-1
votes

In logs it will show the total stages but when you go to localhost:4040 you can see there is some task skip because of caching so monitor jobs more properly with spark UI localhost:4040