Assume I have a clusters with 10 executors. I have a code like following:
val tokenized = sc.textFile(args(0)).flatMap(_.split(' ')).map((_, 1))
val wordCounts = tokenized.reduceByKey(_ + _)
The file is very big --bigger than total mem of the cluster. Saying,I set partition number 100, so each partition is able to be loaded in exec, and I have 100 tasks doing load and flatmap. My question is where is the tokenized rdd/partition(the intermediate result stored) stored, --I did not use cache()
? I assume spark would spill the partitions into disk. In this case, what is the different btw the code with cache?
val tokenized = sc.textFile(args(0)).flatMap(_.split(' ')).map((_, 1)).cache()
val wordCounts = tokenized.reduceByKey(_ + _)
Will spark still shuffle even I cached the tokenized rdd?