I run a map only job(on Hadoop) in order to sort the key values, because it's said "Hadoop automatically sorts data emitted by mappers before being sent to reducers".
input file
2013-04-15 835352
2013-04-16 846299
2013-04-17 828286
2013-04-18 747767
2013-04-19 807924
I think Map(second_cloumn, first_column) should sort this file as shown in output1. It actually did when i run this on my local machine. But when i run this on a cluster, the output is like shown in output2.
output1 file
747767 2013-04-18
807924 2013-04-19
828286 2013-04-17
835352 2013-04-15
846299 2013-04-16
output2 file
835352 2013-04-15
747767 2013-04-18
807924 2013-04-19
828286 2013-04-17
846299 2013-04-16
How can I guarantee it to be always like in the output1. I am open for another suggestions for sorting by the second column.
Mapper
public class MapAccessTime1 extends Mapper<LongWritable, Text, IntWritable, Text> {
private IntWritable one = new IntWritable(1);
private Text word = new Text();
@Override
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
int val = 0;
StringTokenizer tokenizer = new StringTokenizer(line);
if (!line.startsWith("#")) {
if (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
}
if (tokenizer.hasMoreTokens()) {
val = Integer.parseInt(tokenizer.nextToken());
one = new IntWritable(val);
context.write(one, word);
}
}
}
}