I have the following spark job, trying to keep everything in memory:
val myOutRDD = myInRDD.flatMap { fp =>
val tuple2List: ListBuffer[(String, myClass)] = ListBuffer()
:
tuple2List
}.persist(StorageLevel.MEMORY_ONLY).reduceByKey { (p1, p2) =>
myMergeFunction(p1,p2)
}.persist(StorageLevel.MEMORY_ONLY)
However, when I looked in to the job tracker, I still have a lot of Shuffle Write and Shuffle spill to disk ...
Total task time across all tasks: 49.1 h
Input Size / Records: 21.6 GB / 102123058
Shuffle write: 532.9 GB / 182440290
Shuffle spill (memory): 370.7 GB
Shuffle spill (disk): 15.4 GB
Then the job failed because "no space left on device"
... I am wondering for the 532.9 GB Shuffle write here, is it written to disk or memory?
Also, why there are still 15.4 G data spill to the disk while I specifically ask to keep them in the memory?
Thanks!