4
votes

I'm setting up a Hadoop cluster for testing/PoC purposes. Is there anything today which can't be done with Spark as the processing engine? It looks to me Spark has replaced MR and all other tools/abstractions built around Hadoop are compatible with Spark too (Hive, Sqoop, Flume, HBase, Pig...) - are there otherwise any limitations? As I understand Spark is at least as fast as MR even for batch jobs and if you want to expand your use cases in the future (streaming), then you'd need to adapt to Spark anyway.

I'm asking because as of today still most introductions and tutorials teach you about Hadoop and MapReduce.

1

1 Answers

4
votes

Quite simply, no, there is no reason to use MapReduce these days. Spark is faster, easier to work with, has more ongoing development, more connectors, more users, more optimisations...

MapReduce is used in tutorials because many tutorials are outdated, but also because MapReduce demonstrates the underlying methods by which data is processed in all distributed systems. In my opinion anyone wanting to work with "big data" should (at least conceptually) understand MapReduce.