python mapreduce example for max/min temperature in hadoop

Question

I have setup the hadoop on my ubuntu, and ran example codes to test. One of the common examples is https://github.com/tomwhite/hadoop-book/tree/master/ch02/src/main/python

I have tested this code w/ given sample file(https://github.com/tomwhite/hadoop-book/blob/master/input/ncdc/sample.txt). However, when I modified the mapper code acording to my data file, reducer goes from 0% to 33% and then back to 0%. Can anyone help on why that happens or how should I modify the code. My data looks like:

STN---,WBAN , YEARMODA,   TEMP,  ,   DEWP,  ,  SLP  ,  ,  STP  ,  , VISIB,  ,  WDSP,  , MXSPD,  GUST,   MAX  ,  MIN  ,PRCP  ,SNDP , FRSHTT,


690190,13910, 20120101,   42.9,18,   29.4,18, 1033.3,18,  968.7,18,  10.0,18,   8.7,18,  15.0, 999.9,   52.5*,  31.6*, 0.00I,999.9, 000000,

its like /user/hadoop/../_logs ---> /_logs/history there are two files, a .jar and conf.xml. — farey

Chris White Chris White · Accepted Answer · 2013-06-28T01:27:04

If you check the job tracker, i'm sure that the map task is failing and being rescheduled to run on another node (eventually the job fails). This is probably due to the python script throwing an error so i would recommend (if you haven't already done this) to pipe your sample data through your mapper to see what it yields.

For example i took your data and ran it through the linked python mapper (with an additional println to see the extracted columns:

#> cat data.csv | python map.py
EARM  MXSP D


0120   15. 0
0120      15.

Obviously your mapper has been amended as you note in your question - so you need to make sure the python script processes your sample data without error. If it runs without error then you need to check the logs for the failed map tasks (post them into your question)

python mapreduce example for max/min temperature in hadoop

1 Answers