Is map reduce applicable to unstructured sources of small size?

Question

I've done some research and I strive to figure out which is the typical use case for Hadoop. What I've understood so far is that it should be the best approach for a batch processing, when data size is in the order of terabytes at least, sources are also unstructured and the algorithm is sequential, like counting the occurrence of words in many documents... At high level my understanding is that the key point is to move the code toward data nodes instead of the opposite, traditional approach.

But

1) I still fail to see - in a simple manner - why other classical parallel programming implementation should not reach similar performance and

2) I wonder whether Hadoop map reduce paradigm could be applicable to use cases, in which the data size is smaller (even though the sources are also unstructured) or what would be the more appropriate technology in that case?

Paul Back Paul Back · Accepted Answer · 2017-03-16T03:10:33

Your questions are very valid. I've worked pretty deeply with MapReduce and other Big Data ecosystem parallel frameworks, so hopefully I can provide some context. For the purposes of this conversation, let's consider the definition of hadoop to be an environment consisting of HDFS and MapReduce (forget Hive, Pig, etc).

1) Other parallel programming frameworks can achieve (and exceed) Hadoop's performance. Hadoop's advantages over most other models are fault-tolerance and the fact that a lot of the low-level details are abstracted away from the application developer, so you don't need to be an expert systems programmer to get work done at multi-petabyte scale.

2) MapReduce will function at basically any scale (see Apache's word count example here for instance, it's tiny). That being said it has some pretty substantial overhead with respect to operations like figuring out where to write the files and chunking up the work on the compute side (all handled for you by hadoop). At small scale, you'd be better off processing the data with traditional map() and reduce() functions. The concept is completely the same but with a different means of execution.

Is map reduce applicable to unstructured sources of small size?

1 Answers