3
votes

I'm using Spark 1.2.0 and haven't configured SPARK_LOCAL_DIRS explicitly so assuming that persisted RDDs would go to /tmp. I'm trying to persist and RDD using the following code:

    val inputRDD=sc.parallelize(List(1,2,3,3,4,5,6,7,8,9,19,22,21,25,34,56,4,32,56,70))
    val result = inputRDD.map(x=>x*x)
    println("Result count is: "+result.count())
    result.persist(StorageLevel.DISK_ONLY)
    println(result.collect().mkString(",,"))
    println("Result count is: "+result.count()) 

I force a count() on my RDD before and after persist just to be sure but i still don't see any new files or directories in /tmp. The only directory that changes when i run my code is hsperfdata.... which i know is for JVM perf data.

Where are my persisted RDDs going?

1
what's your cluster configurations? - eliasah
I haven't configured a cluster per se. Using IntelliJ for Scala and have just linked Spark libraries to my project. I'm still learning so haven't gotten around to configuring the spark-env.sh file yet. - Jimit Raithatha
Start reading the official documentation! I believe that you have some basic concept comprehension missing. - eliasah

1 Answers

0
votes

From scaladoc of RDD.persist()

Set this RDD's storage level to persist its values across operations after the first time it is computed. This can only be used to assign a new storage level if the RDD does not have a storage level set yet. Local checkpointing is an exception.

So you've called result.count() on the line above result.persist(), by then Spark already set results persistence to be the default. Remove that count op and try again.