2
votes

I run a spark application, it uses a StorageLevel.OFF_HEAP to persist a rdd(my tachyon and spark are both in local mode).

like this:

val lines = sc.textFile("FILE_PATH/test-lines-1")
val words = lines.flatMap(_.split(" ")).map(word => (word, 1)).persist(StorageLevel.OFF_HEAP)
val counts = words.reduceByKey(_ + _)
counts.collect.foreach(println)
...
sc.stop

when persist done, I can see my OFF_HEAP files from localhost:19999(tachyon's web UI), this is what i excepted.

But, after the spark application over(sc.stop, but tachyon is working), my blocks(OFF_HEAP rdd) were removed. And I can not find my files from localhost:19999. This is not what I want. I think these files belong to Tachyon (not spark) after persist() method, they should not be removed.

so, who deleted my files, and when? Is this the normal way?

1

1 Answers

2
votes

You are looking for

  saveAs[Text|Parquet|NewHadoopAPI]File()

This is the real "persistent" method you need.

Instead

persist()

is used for intermediate storage of RDD's: when the spark process ends they will be removed. Here is from the source code comments:

  • Set this RDD's storage level to persist its values across operations after the first time it is computed.

The important phrase is across operations - that is as part of processing (only).