0
votes

My question is this. Apache Hadoop, in its documentation mentions one following example code for hadoop streaming:

$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
-input myInputDirs \
-output myOutputDir \
-mapper /bin/cat \
-reducer /bin/wc

Now I feed a text file to this streamer. Let us say, the text file contains just the following two lines:

This is line1
It becomes line2

The hadoop streaming command works perfectly and there is no problem.

But I am unable to answer the following questions despite reading many times above linked material and other examples on the Internet. Let us say there is only one mapper and only one reducer:

  1. Mapper, as I understand, gets a (key, value) pair as input. In the case of the above two lines what would be the key and what would be the value.
  2. Mapper function is 'cat'. Will 'cat' act on key part of the mapper or the value part of the mapper.
  3. What would be the output of mapper if the input are just the above two lines. What would be the 'key' and what would be the 'value' part.
  4. Reducer will get these (key,value) pairs. The reducer function is 'wc'. How would 'wc' know whether to act on 'key' or to act on 'value' of this tuple.

I understand these are very basic questions but I am getting stuck again and again to get a proper answer. Will be grateful for help.

Thanks.

1

1 Answers

0
votes

In the case of the above two lines what would be the key and what would be the value.

The key is the offset of the line. The value is the entire line text

Mappers act on both keys and values

The output of the mapper will be the same, I believe, or at least just (null, line), for every line.

wc would operate on every unique key, so if you get only one result as the output, then the input was likely (null, ["this line one", "it becomes line2"]), and the list of values is counted