Everywhere on google the key difference between Spark and Hadoop MapReduce is stated in the approach to processing: Spark can do it in-memory, while Hadoop MapReduce has to read from and write to a disk. Looks like I got it, but I would like to confirm it with an example.
Consider this word count example:
val text = sc.textFile("mytextfile.txt")
val counts = text.flatMap(line => line.split(" ")).map(word => (word,1)).reduceByKey(_+_)
counts.collect
My understanding:
In case of Spark, once the lines are split by " ", output will be stored in memory. Similarly with functions map and reduce. I believe same is true for when processing is happening across partitions.
In the case of MapReduce , will each intermediate results (like words after split/map/reduce) be kept on disk i.e. HDFS, which makes it slower compared to Spark? There is no way we can keep them in memory ? Same is the case of partitions results ?