0
votes

I ran Hadoop-Mapreduce job wordcount program on 1MB data. I have some doubts to understand the information bellow:

  • What is counter?
  • Why maptasks are two , as I know that number of maps are decided by # of input split ,and minimum size of input split is 64MB. So logically there should be only one Map task!?

  • What is the size of output data from reducers?

  • CPU time spent , which CPU cause each tasktracker has its own CPU &memory?

Thanks a lot!

[user1@li417-43 ~]$ hadoop jar wordcount1.jar wordcount1.WordCount -D mapred.reduce.tasks=10 wordin wordout10-1m
    14/12/16 19:55:46 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
    14/12/16 19:55:46 INFO mapred.FileInputFormat: Total input paths to process : 1
    14/12/16 19:55:46 INFO mapred.JobClient: Running job: job_201405031326_0032
    14/12/16 19:55:47 INFO mapred.JobClient:  map 0% reduce 0%
    14/12/16 19:55:59 INFO mapred.JobClient:  map 100% reduce 0%
    14/12/16 19:56:04 INFO mapred.JobClient:  map 100% reduce 40%
    14/12/16 19:56:09 INFO mapred.JobClient:  map 100% reduce 80%
    14/12/16 19:56:14 INFO mapred.JobClient:  map 100% reduce 100%
    14/12/16 19:56:15 INFO mapred.JobClient: Job complete: job_201405031326_0032
    14/12/16 19:56:15 INFO mapred.JobClient: Counters: 34
    14/12/16 19:56:15 INFO mapred.JobClient:   File System Counters
    14/12/16 19:56:15 INFO mapred.JobClient:     FILE: Number of bytes read=2008100
    14/12/16 19:56:15 INFO mapred.JobClient:     FILE: Number of bytes written=5988058
    14/12/16 19:56:15 INFO mapred.JobClient:     FILE: Number of read operations=0
    14/12/16 19:56:15 INFO mapred.JobClient:     FILE: Number of large read operations=0
    14/12/16 19:56:15 INFO mapred.JobClient:     FILE: Number of write operations=0
    14/12/16 19:56:15 INFO mapred.JobClient:     HDFS: Number of bytes read=1005254
    14/12/16 19:56:15 INFO mapred.JobClient:     HDFS: Number of bytes written=140119
    14/12/16 19:56:15 INFO mapred.JobClient:     HDFS: Number of read operations=14
    14/12/16 19:56:15 INFO mapred.JobClient:     HDFS: Number of large read operations=0
    14/12/16 19:56:15 INFO mapred.JobClient:     HDFS: Number of write operations=20
    14/12/16 19:56:15 INFO mapred.JobClient:   Job Counters
    14/12/16 19:56:15 INFO mapred.JobClient:     Launched map tasks=2
    14/12/16 19:56:15 INFO mapred.JobClient:     Launched reduce tasks=10
    14/12/16 19:56:15 INFO mapred.JobClient:     Data-local map tasks=1
    14/12/16 19:56:15 INFO mapred.JobClient:     Rack-local map tasks=1
    14/12/16 19:56:15 INFO mapred.JobClient:     Total time spent by all maps in occupied slots (ms)=12953
    14/12/16 19:56:15 INFO mapred.JobClient:     Total time spent by all reduces in occupied slots (ms)=49609
    14/12/16 19:56:15 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
    14/12/16 19:56:15 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
    14/12/16 19:56:15 INFO mapred.JobClient:   Map-Reduce Framework
    14/12/16 19:56:15 INFO mapred.JobClient:     Map input records=35293
    14/12/16 19:56:15 INFO mapred.JobClient:     Map output records=181014
    14/12/16 19:56:15 INFO mapred.JobClient:     Map output bytes=1646012
    14/12/16 19:56:15 INFO mapred.JobClient:     Input split bytes=206
    14/12/16 19:56:15 INFO mapred.JobClient:     Combine input records=0
    14/12/16 19:56:15 INFO mapred.JobClient:     Combine output records=0
    14/12/16 19:56:15 INFO mapred.JobClient:     Reduce input groups=14276
    14/12/16 19:56:15 INFO mapred.JobClient:     Reduce shuffle bytes=2008160
    14/12/16 19:56:15 INFO mapred.JobClient:     Reduce input records=181014
    14/12/16 19:56:15 INFO mapred.JobClient:     Reduce output records=14276
    14/12/16 19:56:15 INFO mapred.JobClient:     Spilled Records=362028
    14/12/16 19:56:15 INFO mapred.JobClient:     CPU time spent (ms)=26020
    14/12/16 19:56:15 INFO mapred.JobClient:     Physical memory (bytes) snapshot=1427562496
    14/12/16 19:56:15 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=8291246080
    14/12/16 19:56:15 INFO mapred.JobClient:     Total committed heap usage (bytes)=477896704
    14/12/16 19:56:15 INFO mapred.JobClient:   org.apache.hadoop.mapreduce.lib.input.FileInputFormatCounter
    14/12/16 19:56:15 INFO mapred.JobClient:     BYTES_READ=1002479
1

1 Answers

1
votes
  1. Counter : 34 is the number of counters (number of information below)

  2. I think, this is due to speculative execution (search speculative on [https://developer.yahoo.com/hadoop/tutorial/module4.html]. Hadoop launches 2 times the same mapper to see which will finish first (and then the second one is killed). You can disable it be changing the mapred.map.tasks.speculative.execution configuration property in the mapred-site.xml file.

One mapper was launch in local, the second one on the same rack but on an other node. (Data-local map tasks=1, Rack-local map tasks=1)

  1. You have 14276 lines in the output of yours reducers (Reduce output records=14276).

  2. CPU time spent (ms) is the total of CPU time consumed by each task on each node. It's for comparison purpose.