Why is Spark faster than Hadoop Map Reduce

Question

Can someone explain using the word count example, why Spark would be faster than Map Reduce?

Another possible duplicate Is caching the only advantage of spark over map-reduce? — zaxliu

zaxliu zaxliu · Accepted Answer · 2015-09-15T06:06:29

bafna's answer provides the memory-side of the story, but I want to add other two important facts：DAG and ecosystem

Spark uses "lazy evaluation" to form a directed acyclic graph (DAG) of consecutive computation stages. In this way, the execution plan can be optimized, e.g. to minimize shuffling data around. In contrast, this should be done manually in MapReduce by tuning each MR step. (It would be easier to understand this point if you are familiar with the execution plan optimization in RDBMS or the DAG-style execution of Apache Tez)
Spark ecosystem has established a versatile stack of components to handle SQL, ML, Streaming, Graph Mining tasks. But in the hadoop ecosystem you have to install other packages to do these individual things.

And I want to add that, even if your data is too big for main memory, you can still use spark by choosing to persist you data on disk. Although by doing this you give up the advantages of in-memory processing, you can still benefit from the DAG execution optimization.

Some informative answers on Quora: here and here.

Why is Spark faster than Hadoop Map Reduce

2 Answers