I have a situation where I need to process a very large text file with the following format:
ID \t time \t duration \t Description \t status
I want to utilize MapReduce to help me process this file. I understand that MapReduce works based on key,value pair. Mapper will output key and some value and MapReduce will ensure that all same key end up in 1 reducer.
What I want to end up in a reducer is the rows that have time is within 1 hour of each other. Then in reducer, I would like to access all other info as well such as ID, duration, status to do other things. So I guess the value to output is a list or something?
I had some Python code to process input data. mapper.py
#!/usr/bin/env python
import sys
import re
for line in sys.stdin:
line=line.strip()
portions=re.split(r'\t+',line)
time=portions[1]
#output key,value by print to stdout for reducer.py to read in.
Please note that the time in my data set is already in POSIX-time format.
How could I output key,value pair in Mapper to do that?
I'm still very new to MapReduce/Hadoop and appreciate all the help. Thank you in advance!