I am working on Map-Only MR job in Microsoft HDInsight that is based on Hortonworks. My input data is around 1GB and the block size is 128MB.
When I run my job without setting size of split, my input data is separated into 2 splits and number of map task is 2, too. It takes long time so I want to speed up this process by increase number of map tasks.
I set number of splits by setting value of mapreduce.input.fileinputformat.split.minsize
and mapreduce.input.fileinputformat.split.minsize
.
First, I set my number of splits as 8, the time consumed for this job is 35 mins. Then I set it as 16 and 64, and time consumed are 21 mins and 16 mins respectively.
But when I set splits to 128, time consumed for this job is increasing from 16 mins to 18 mins.
My questions:
1: why does the time increase with more map tasks? I know it takes some time to instantiate a map class, but any other reasons?
2: Is there a way to decide the most appropriate size of split?
Thank you PS: my input file is text file without ".txt".