I have hadoop 2.5.2 single mode installation on my Ubuntu VM, which is: 4-core, 3GHz per core; 4G memory. This VM is not for production, only for demo and learning.
Then, I wrote a vey simple map-reduce application using python, and use this application to process 49 xmls. All these xml files are small-size, hundreds of lines each. So, I expected a quick process. But, big22 surprise to me, it took more than 20 minutes to finish the job (the output of the job is correct.). Below is the output metrics :
14/12/15 19:37:55 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
14/12/15 19:37:57 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
14/12/15 19:38:03 INFO mapred.FileInputFormat: Total input paths to process : 49
14/12/15 19:38:06 INFO mapreduce.JobSubmitter: number of splits:49
14/12/15 19:38:08 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1418368500264_0005
14/12/15 19:38:10 INFO impl.YarnClientImpl: Submitted application application_1418368500264_0005
14/12/15 19:38:10 INFO mapreduce.Job: Running job: job_1418368500264_0005
14/12/15 19:38:59 INFO mapreduce.Job: Job job_1418368500264_0005 running in uber mode : false
14/12/15 19:38:59 INFO mapreduce.Job: map 0% reduce 0%
14/12/15 19:39:42 INFO mapreduce.Job: map 2% reduce 0%
14/12/15 19:40:05 INFO mapreduce.Job: map 4% reduce 0%
14/12/15 19:40:28 INFO mapreduce.Job: map 6% reduce 0%
14/12/15 19:40:49 INFO mapreduce.Job: map 8% reduce 0%
14/12/15 19:41:10 INFO mapreduce.Job: map 10% reduce 0%
14/12/15 19:41:29 INFO mapreduce.Job: map 12% reduce 0%
14/12/15 19:41:50 INFO mapreduce.Job: map 14% reduce 0%
14/12/15 19:42:08 INFO mapreduce.Job: map 16% reduce 0%
14/12/15 19:42:28 INFO mapreduce.Job: map 18% reduce 0%
14/12/15 19:42:49 INFO mapreduce.Job: map 20% reduce 0%
14/12/15 19:43:08 INFO mapreduce.Job: map 22% reduce 0%
14/12/15 19:43:28 INFO mapreduce.Job: map 24% reduce 0%
14/12/15 19:43:48 INFO mapreduce.Job: map 27% reduce 0%
14/12/15 19:44:09 INFO mapreduce.Job: map 29% reduce 0%
14/12/15 19:44:29 INFO mapreduce.Job: map 31% reduce 0%
14/12/15 19:44:49 INFO mapreduce.Job: map 33% reduce 0%
14/12/15 19:45:09 INFO mapreduce.Job: map 35% reduce 0%
14/12/15 19:45:28 INFO mapreduce.Job: map 37% reduce 0%
14/12/15 19:45:49 INFO mapreduce.Job: map 39% reduce 0%
14/12/15 19:46:09 INFO mapreduce.Job: map 41% reduce 0%
14/12/15 19:46:29 INFO mapreduce.Job: map 43% reduce 0%
14/12/15 19:46:49 INFO mapreduce.Job: map 45% reduce 0%
14/12/15 19:47:09 INFO mapreduce.Job: map 47% reduce 0%
14/12/15 19:47:29 INFO mapreduce.Job: map 49% reduce 0%
14/12/15 19:47:49 INFO mapreduce.Job: map 51% reduce 0%
14/12/15 19:48:08 INFO mapreduce.Job: map 53% reduce 0%
14/12/15 19:48:28 INFO mapreduce.Job: map 55% reduce 0%
14/12/15 19:48:48 INFO mapreduce.Job: map 57% reduce 0%
14/12/15 19:49:09 INFO mapreduce.Job: map 59% reduce 0%
14/12/15 19:49:29 INFO mapreduce.Job: map 61% reduce 0%
14/12/15 19:49:55 INFO mapreduce.Job: map 63% reduce 0%
14/12/15 19:50:23 INFO mapreduce.Job: map 65% reduce 0%
14/12/15 19:50:53 INFO mapreduce.Job: map 67% reduce 0%
14/12/15 19:51:22 INFO mapreduce.Job: map 69% reduce 0%
14/12/15 19:51:50 INFO mapreduce.Job: map 71% reduce 0%
14/12/15 19:52:18 INFO mapreduce.Job: map 73% reduce 0%
14/12/15 19:52:48 INFO mapreduce.Job: map 76% reduce 0%
14/12/15 19:53:18 INFO mapreduce.Job: map 78% reduce 0%
14/12/15 19:53:48 INFO mapreduce.Job: map 80% reduce 0%
14/12/15 19:54:18 INFO mapreduce.Job: map 82% reduce 0%
14/12/15 19:54:48 INFO mapreduce.Job: map 84% reduce 0%
14/12/15 19:55:19 INFO mapreduce.Job: map 86% reduce 0%
14/12/15 19:55:48 INFO mapreduce.Job: map 88% reduce 0%
14/12/15 19:56:16 INFO mapreduce.Job: map 90% reduce 0%
14/12/15 19:56:44 INFO mapreduce.Job: map 92% reduce 0%
14/12/15 19:57:14 INFO mapreduce.Job: map 94% reduce 0%
14/12/15 19:57:45 INFO mapreduce.Job: map 96% reduce 0%
14/12/15 19:58:15 INFO mapreduce.Job: map 98% reduce 0%
14/12/15 19:58:46 INFO mapreduce.Job: map 100% reduce 0%
14/12/15 19:59:20 INFO mapreduce.Job: map 100% reduce 100%
14/12/15 19:59:28 INFO mapreduce.Job: Job job_1418368500264_0005 completed successfully
14/12/15 19:59:30 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=17856
FILE: Number of bytes written=5086434
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=499030
HDFS: Number of bytes written=10049
HDFS: Number of read operations=150
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=49
Launched reduce tasks=1
Data-local map tasks=49
Total time spent by all maps in occupied slots (ms)=8854232
Total time spent by all reduces in occupied slots (ms)=284672
Total time spent by all map tasks (ms)=1106779
Total time spent by all reduce tasks (ms)=35584
Total vcore-seconds taken by all map tasks=1106779
Total vcore-seconds taken by all reduce tasks=35584
Total megabyte-seconds taken by all map tasks=1133341696
Total megabyte-seconds taken by all reduce tasks=36438016
Map-Reduce Framework
Map input records=9352
Map output records=296
Map output bytes=17258
Map output materialized bytes=18144
Input split bytes=6772
Combine input records=0
Combine output records=0
Reduce input groups=53
Reduce shuffle bytes=18144
Reduce input records=296
Reduce output records=52
Spilled Records=592
Shuffled Maps =49
Failed Shuffles=0
Merged Map outputs=49
GC time elapsed (ms)=33590
CPU time spent (ms)=191390
Physical memory (bytes) snapshot=13738057728
Virtual memory (bytes) snapshot=66425016320
Total committed heap usage (bytes)=10799808512
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=492258
File Output Format Counters
Bytes Written=10049
14/12/15 19:59:30 INFO streaming.StreamJob: Output directory: /data_output/sb50projs_1_output
As a newbie to hadoop, for this crazy unreasonable performance, I have several questions:
- how to configure my hadoop/yarn/mapreduce to make the whole environment more convenient for trial usage?
I understand hadoop is designed for huge-data and big files. But for a trial environment, my files are small and my data is very limited, which default configuration items should I change? I have changed "dfs.blocksize" of hdfs-site.xml to a smaller value to match my small files, but seems no big enhancements. I know there are some JVM configuration items in yarn-site.xml and mapred-site.xml, but I am not sure about how to adjust them.
- how to read hadoop logs
Under the logs folder, there are separate log files for nodemanager/resourcemanager/namenode/datanode. I tried to read these files to understand how the 20 minutes are spent during the process, but it's not easy for a newbie like me. So I wonder is there any tool/UI could help me to analyze the logs.
- basic performance tuning tools
Actually I have googled around for this question, and I got a bunch of names like Ganglia/Nagios/Vaidya/Ambari. I want to know, which tool is best analyse the issue like , "why it took 20 minutes to do such a simple job?".
- big number of hadoop processes
Even if there is no job running on my hadoop, I found around 100 hadoop processes on my VM, like below (I am using htop, and sort the result by memory). Is this normal for hadoop ? Or am I incorrect for some environment configuration?