0
votes

Everywhere on google the key difference between Spark and Hadoop MapReduce is stated in the approach to processing: Spark can do it in-memory, while Hadoop MapReduce has to read from and write to a disk. Looks like I got it, but I would like to confirm it with an example.

Consider this word count example:

 val text = sc.textFile("mytextfile.txt") 
 val counts = text.flatMap(line => line.split(" ")).map(word => (word,1)).reduceByKey(_+_) 
 counts.collect

My understanding:

In case of Spark, once the lines are split by " ", output will be stored in memory. Similarly with functions map and reduce. I believe same is true for when processing is happening across partitions.

In the case of MapReduce , will each intermediate results (like words after split/map/reduce) be kept on disk i.e. HDFS, which makes it slower compared to Spark? There is no way we can keep them in memory ? Same is the case of partitions results ?

1

1 Answers

0
votes

Yes, you are right.

The SPARK intermediate RDD (Resilient Distributed Dataset) results are kept in memory and hence latency is a lot lower and job throughput higher. RDDs have partitions, chunks of data like MR. SPARK also offers iterative processing, also a key point to consider.

MR does have a Combiner of course to ease the pain a little.

But SPARK is far easier to use as well, with Scala or pyspark.

I would not worry about MR anymore - in general.

Here is an excellent read on SPARK BTW: https://medium.com/@goyalsaurabh66/spark-basics-rdds-stages-tasks-and-dag-8da0f52f0454