2
votes

Assume I have a clusters with 10 executors. I have a code like following:

val tokenized = sc.textFile(args(0)).flatMap(_.split(' ')).map((_, 1))
val wordCounts = tokenized.reduceByKey(_ + _)

The file is very big --bigger than total mem of the cluster. Saying,I set partition number 100, so each partition is able to be loaded in exec, and I have 100 tasks doing load and flatmap. My question is where is the tokenized rdd/partition(the intermediate result stored) stored, --I did not use cache()? I assume spark would spill the partitions into disk. In this case, what is the different btw the code with cache?

val tokenized = sc.textFile(args(0)).flatMap(_.split(' ')).map((_, 1)).cache() 
val wordCounts = tokenized.reduceByKey(_ + _)

Will spark still shuffle even I cached the tokenized rdd?

1

1 Answers

1
votes

I assume spark would spill the partitions into disk.

That's correct. Spark will store data in the local file system if needed.

what is the different (...) with cache?

Not much. Caching doesn't affect the first execution. It will only try to store data so it can be reused later.

Will spark still shuffle even I cached the tokenized rdd?

Yes, it will. Caching doesn't replace shuffling although caching can be obsolete here since shuffle automatically caches temporary files.