How the input file gets split into chunks by the map-reduce framework?

Question

I have an iterative mapreduce job in which, when a chunk, let's say Chunk i, is read by a mapper some information regarding the records within this chunk is stored in an auxiliary file, called F_i. In the next iteration (job), a different mapper might read Chunk i. However, this mapper must update some information in auxiliary file Fi. Is there any mechanism to do this?

I believe if we can get a way to distinguish between different chunks we can get it solved. e.g. if each chunk has a unique name, then the mapper can simply read the auxiliary file for the chunk it has fed by.

I would like to see whether there is any way (mechanism) for doing this. — HHH

Tariq Tariq · Accepted Answer · 2013-09-19T17:44:25

Use a custom counter. Update the counter in each map as you keep processing your splits starting from 1. So, for split#1 counter=1. And name the file accordingly, like F_1 for chunk 1.

Apply the same trick in the next iteration. Create a counter and keep on increasing it as your mapppers proceed. Check the counter value everytime you enter in a mapper and read the file which has the same number in its name as this counter's value. For example :

Suppose in the first iteration you processed 5 chunks. This means you end up with 5 files, F_1, F_2 and so on. Now, in the second phase you'll again start from the chunk 1. Create the counter update it by 1. Now check the counters value inside the mapper itself and if the value is 1, you know that you have to read the file named File_1.

How the input file gets split into chunks by the map-reduce framework?

1 Answers