I've written some code in Hadoop that should do the following tasks:
In the Mapper: Records are read one by one from input splits and some processing is performed on them. Then, in accordance with the result of the work done, Some records are pruned and save in a set. At the end of the mapper this set must be sent to reducer.
In the Reducer: All of received sets from all Mappers are processed and final result is generated.
My question is: how can I delay sending mentioned set to the Reducer until processing of the last record in each of mappers. By default, all code that are written in Mapper runs as the number of input records (correct if wrong); So sets are sent to reducer multiple time (as the number of input records). How can I recognize end of processing of the input splits in each mapper?
(Now I use an if-condition with a counter for counting the number of processed records, but I think there must be better ways. Also if I don't know total number of records in files, this method does not work)
This is flowchart of the job :
cleanup()
method which is called once when all records have been processed by themap()
method? – Binary Nerd