0
votes

I am trying to make a mapper/reducer program to calculate max/min temp from a data set. I have tried to modify by myself but the code doesn't work. The mapper runs fine but reducer doesn't, given I made changes in mapper.

My sample code: mapper.py

import re
import sys

for line in sys.stdin:
  val = line.strip()
  (year, temp, q) = (val[14:18], val[25:30], val[31:32])
  if (temp != "9999" and re.match("[01459]", q)):
    print "%s\t%s" % (year, temp)

reducer.py

import sys
   (last_key, max_val) = (None, -sys.maxint)
   for line in sys.stdin:
   (key, val) = line.strip().split("\t")
   if last_key and last_key != key:
        print "%s\t%s" % (last_key, max_val)
        (last_key, max_val) = (key, int(val))
        else:
        (last_key, max_val) = (key, max(max_val, int(val)))

    if last_key:
           print "%s\t%s" % (last_key, max_val)

sample line from file:

690190,13910, 2012**0101, *42.9,18, 29.4,18, 1033.3,18, 968.7,18, 10.0,18, 8.7,18, 15.0, 999.9, 52.5, 31.6*, 0.00I,999.9, 000000,

I need the values in bold. Any idea!!

this is my output if i run mapper as a simple code:

root@ubuntu:/home/hduser/files# python maxtemp-map.py
2012    42.9
2012    50.0
2012    47.0
2012    52.0
2012    43.4
2012    52.6
2012    51.1
2012    50.9
2012    57.8
2012    50.7
2012    44.6
2012    46.7
2012    52.1
2012    48.4
2012    47.1
2012    51.8
2012    50.6
2012    53.4
2012    62.9
2012    62.6

The file contains different years data. I have to calculate min, max, and avg for each yr.

FIELD   POSITION  TYPE   DESCRIPTION

STN---  1-6       Int.   Station number (WMO/DATSAV3 number)
                         for the location.

WBAN    8-12      Int.   WBAN number where applicable--this is the
                         historical 
YEAR    15-18     Int.   The year.

MODA    19-22     Int.   The month and day.

TEMP    25-30     Real   Mean temperature. Missing = 9999.9


Count   32-33     Int.   Number of observations in mean temperature
2
i need 2012 & temp 42.9 from each line. :/ - farey
What do the * represent? Do all numbers of each line represent the same quantity (temperature)? - wflynny
stars added coz i tried to make them bold. :/ lol. ignore please. each line has number of parameters, like stddev, year, month, date, temp, avg temp, max.temp etc..i have shown one of the line to give an idea of the pattern. - farey

2 Answers

0
votes

I am having trouble parsing your question, but I think it reduces to this:

You have a dataset and each line of the dataset represents different quantities related to a single time point. You would like to extract the max/min of one of these quantities from the entire dataset.

If this is the case, I'd do something like this:

temps = []
with open(file_name, 'r') as infile:
    for line in infile:
        line = line.strip().split(',')
        year = int(line[2][:4])
        temp = int(line[3])
        temps.append((temp, year))

temps = sorted(temps)
min_temp, min_year = temps[0]
max_temp, max_year = temps[-1]

EDIT:

Farley, I think what you are doing with mapper/reducer may be overkill for what you want from your data. Here are some additional questions about your initial file structure.

  1. What are the contents of each line (be specific) in the dataset? For example: date, time, temp, pressure, ....
  2. Which piece of data from each line do you want to extract? Temperature? At what position in the line is that piece of data?
  3. Does each file only contain data from one year?

For example, if your dataset looked like

year, month, day, temp, pressure, cloud_coverage, ...
year, month, day, temp, pressure, cloud_coverage, ...
year, month, day, temp, pressure, cloud_coverage, ...
year, month, day, temp, pressure, cloud_coverage, ...
year, month, day, temp, pressure, cloud_coverage, ...
year, month, day, temp, pressure, cloud_coverage, ...

then the simplest thing to do is to loop through each line and extract the relevant information. It appears you only want the year and the temperature. In this example, these are located at positions 0 and 3 in each line. Therefore, we will have a loop that looks like

from collections import defaultdict
data = defaultdict(list)

with open(file_name, 'r') as infile:
    for line in infile:
        line = line.strip().split(', ')
        year = line[0]
        temp = line[3]
        data[year].append(temp)

See, we extracted the year and temp from each line in the file and stored them in a special dictionary object. What this will look like if we printed it out would be

year1: [temp1, temp2, temp3, temp4]
year2: [temp5, temp6, temp7, temp8]
year3: [temp9, temp10, temp11, temp12]
year4: [temp13, temp14, temp15, temp16]

Now, this makes it very convenient for us to do statistics on all the temperatures of a given year. For example, to compute the maximum, minimum, and average temperature, we could do

import numpy as np
for year in data:
    temps = np.array( data[year] )
    output = (year, temps.mean(), temps.min(), temps.max())
    print 'Year: {0} Avg: {1} Min: {2} Max: {3}'.format(output)

I'm more than willing to help you sort out your problem, but I need you to be more specific about what exactly your data looks like, and what you want to extract.

0
votes

If you have something like the store name and total sales from the store as intermediate result from the mapper you can use the following as reducer to find out the maximum sales and which store has the maximum sales. Similarly it will find out the minimum sales and which store has the minimum sales.

The following reducer code example assumes that you have the sales total against each store as an input file.

#! /usr/bin/python

import sys

mydict = {}

salesTotal = 0
oldKey = None

for line in sys.stdin:
    data=line.strip().split("\t")

    if len(data)!=2:
        continue

    thisKey, thisSale = data

    if oldKey and oldKey != thisKey:
        mydict[oldKey] = float(salesTotal)
        salesTotal = 0

    oldKey = thisKey
    salesTotal += float(thisSale)

if oldKey!= None:
    mydict[oldKey] = float(salesTotal)

maximum = max(mydict, key=mydict.get)
print(maximum, mydict[maximum])

minimum = min(mydict, key=mydict.get)
print(minimum, mydict[minimum])