1
votes

I am working on Map-Only MR job in Microsoft HDInsight that is based on Hortonworks. My input data is around 1GB and the block size is 128MB.

When I run my job without setting size of split, my input data is separated into 2 splits and number of map task is 2, too. It takes long time so I want to speed up this process by increase number of map tasks.

I set number of splits by setting value of mapreduce.input.fileinputformat.split.minsize and mapreduce.input.fileinputformat.split.minsize.

First, I set my number of splits as 8, the time consumed for this job is 35 mins. Then I set it as 16 and 64, and time consumed are 21 mins and 16 mins respectively.

But when I set splits to 128, time consumed for this job is increasing from 16 mins to 18 mins.

My questions:

1: why does the time increase with more map tasks? I know it takes some time to instantiate a map class, but any other reasons?

2: Is there a way to decide the most appropriate size of split?

Thank you PS: my input file is text file without ".txt".

1

1 Answers

0
votes
  1. The reason for the increase in time is due to more map tasks as you mentioned. There is always and tradeoff between number of mappers and the inputsplit size.

In your case instantiating a Mapper class in your JVM might be taking more time than your logic in your Mapper. And one more reason might be non availabilty of the resources in the cluster to the launch the Mapper. Some of them will wait until your current tasks/Mappers finish and they will be instantiated then.

  1. I would suggest just emiting the data through Mappers by making number of reducers as 0 in TextInputFormat. Then it writes to x number of files with each file of your input split size.