Can hadoop map/reduce be speeded up by splitting data size?

Question

Can I increase the performance time of my hadoop map/reduce job by splitting the input data into smaller chunks?

First question: For example, I have 1GB of input file for mapping task. My default block size is 250MB. So only 4 mappers will be assigned to do the job. If I split the data into 10 pieces, each piece will be 100MB, then I have 10 mappers to do the work. But then each split piece will occupy 1 block in the storage, which means 150MB will be wasted for each split data block. What should I do in this case if I don't want to change the block size of my storage?

Second question: If I split input data before mapping job, it can increase the performance of the mapping job. So If I want to do the same for reducing job, should I ask mapper to split the data before giving it to reducer or should I let reducer do it ?

Thank you very much. Please correct me if I also misunderstand something. Hadoop is quite new to me. So any help is appreciated.

Thanga Thanga · Accepted Answer · 2016-01-19T12:00:56

When you change your block size to 100 MB, 150 MB is not wasted. It is still available memory for the system.

If Mappers are increased, it does not mean that it will definitely increase performance. Because it depends on the number of datanodes you have. For example, if you have 10 DataNode -> 10 Mapper, it is a good deal. But if you have 4 datanode -> 10 Mapper, obviously all mappers cannot run simultaneously. So if you have 4 data nodes, it is better to have 4 blocks (with a 250MB block size).

Reducer is something like a merge of all your mappers' output and you can't ask Mapper to split the data. In reverse, you can ask Mapper to do a mini-reduce by defining a Combiner. Combiner is nothing but a reducer in the same node where the mapper was executed, run before sending to the actual reducer. So the I/O will be minimized and so is the work of actual reducer. Introducing a Combiner will be a better option to improve performance

Good luck with Hadoop !!

Can hadoop map/reduce be speeded up by splitting data size?

2 Answers