1
votes

I am trying my hand on Inverted word list problem (for each word, output is a list of file names containing that word) in Hadoop streaming. Input is the name of directory that contains text files. I have written mapper and reducer in python and they work fine when trying with unix piping. However, when executing using Hadoop streaming command, the codes run but the job fails in the end. I suspect it is something in the Mapper code but can't seem to know exactly what is the issue.

I am a beginner (so please excuse if I didn't get anything right), using Cloudera training on VMware Fusion. I have my Mapper and Reducer .py executable files placed in the home directory on local system as well as on hdfs. I have the directory "shakespeare" on hdfs. The below unix pipe command works fine.

echo shakespeare | ./InvertedMapper.py | sort | ./InvertedReducer.py

However, the haddop streaming doesn't.

hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming*.jar -input shakespeare -output InvertedList -mapper InvertedMapper.py -reducer InvertedReducer.py -file InvertedMapper.py -file InvertedReducer.py

#MAPPER CODE

#!/usr/bin/env python

import sys
import os

class Mapper(object):

        def __init__(self, stream, sep='\t'):
                self.stream=stream
                self.sep=sep

        def __iter__(self):
                os.chdir(self.stream.read().strip())
                files = [os.path.abspath(f) for f in os.listdir(".")]
                for file in files:
                        yield file

        def emit(self, key, value):
                sys.stdout.write("{0}{1}{2}\n".format(key,self.sep,value))

        def map(self):
                for file in self:
                        with open(file) as infile:
                                name = file.split("/")[-1].split(".")[0]
                                words = infile.read().strip().split()
                                for word in words:
                                        self.emit(word,name)

 if __name__ == "__main__":
        cwd = os.getcwd()
        mapper = Mapper(sys.stdin)
        mapper.map()
        os.chdir(cwd)


#REDUCER CODE

#!/usr/bin/env python

import sys
from itertools import groupby
from operator import itemgetter

class Reducer(object):
        def __init__(self, stream, sep="\t"):
                self.stream = stream
                self.sep = sep

        def __iter__(self):
                for line in self.stream:
                        try:
                                parts = line.strip().split(self.sep)
                                yield parts[0], parts[1]
                        except:
                                continue

        def emit(self, key, value):
                sys.stdout.write("{0}{1}{2}\n".format(key, self.sep, value))

        def reduce(self):
                for key, group in groupby(self, itemgetter(0)):
                        values = []
                        for item in group:
                                values.append(item[1])
                        values = set(values)
                        values = list(values)
                        self.emit(key, values)
if __name__ == "__main__":
    reducer = Reducer(sys.stdin)
    reducer.reduce()

The output on running the Hadoop command is below.

packageJobJar: [InvertedMapper1.py, /tmp/hadoop-training/hadoop-unjar281431668511629942/] [] /tmp/streamjob679048425003800890.jar tmpDir=null
19/02/17 00:22:19 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
19/02/17 00:22:19 INFO mapred.FileInputFormat: Total input paths to process : 5
19/02/17 00:22:20 INFO streaming.StreamJob: getLocalDirs(): [/var/lib/hadoop-hdfs/cache/training/mapred/local]
19/02/17 00:22:20 INFO streaming.StreamJob: Running job: job_201902041621_0051
19/02/17 00:22:20 INFO streaming.StreamJob: To kill this job, run:
19/02/17 00:22:20 INFO streaming.StreamJob: UNDEF/bin/hadoop job  -Dmapred.job.tracker=0.0.0.0:8021 -kill job_201902041621_0051
19/02/17 00:22:20 INFO streaming.StreamJob: Tracking URL: http://0.0.0.0:50030/jobdetails.jsp?jobid=job_201902041621_0051
19/02/17 00:22:21 INFO streaming.StreamJob:  map 0%  reduce 0%
19/02/17 00:22:34 INFO streaming.StreamJob:  map 40%  reduce 0%
19/02/17 00:22:39 INFO streaming.StreamJob:  map 0%  reduce 0%
19/02/17 00:22:50 INFO streaming.StreamJob:  map 40%  reduce 0%
19/02/17 00:22:53 INFO streaming.StreamJob:  map 0%  reduce 0%
19/02/17 00:23:03 INFO streaming.StreamJob:  map 40%  reduce 0%
19/02/17 00:23:06 INFO streaming.StreamJob:  map 20%  reduce 0%
19/02/17 00:23:07 INFO streaming.StreamJob:  map 0%  reduce 0%
19/02/17 00:23:16 INFO streaming.StreamJob:  map 20%  reduce 0%
19/02/17 00:23:17 INFO streaming.StreamJob:  map 40%  reduce 0%
19/02/17 00:23:19 INFO streaming.StreamJob:  map 20%  reduce 0%
19/02/17 00:23:21 INFO streaming.StreamJob:  map 100%  reduce 100%
19/02/17 00:23:21 INFO streaming.StreamJob: To kill this job, run:
19/02/17 00:23:21 INFO streaming.StreamJob: UNDEF/bin/hadoop job  -Dmapred.job.tracker=0.0.0.0:8021 -kill job_201902041621_0051
19/02/17 00:23:21 INFO streaming.StreamJob: Tracking URL: http://0.0.0.0:50030/jobdetails.jsp?jobid=job_201902041621_0051
19/02/17 00:23:21 ERROR streaming.StreamJob: Job not successful. Error: NA
19/02/17 00:23:21 INFO streaming.StreamJob: killJob...
Streaming Command Failed!
1
the command could be like this: mapred streaming -files /map_file.py, /reduce_file.py -mapper "python3 map_file.py" -reducer "python3 reduce_file.py" -input /input -output / output. and also you made a mistake by self.emit(word, name) because it should be self.emit(word, words). The problem is in the mapper for sure and its because of the data that we have not seen so I think add this to the top of you code # -*-coding:utf-8 -* . Hope it helpsSadegh

1 Answers

0
votes

I don't know if this is the reason why your code fails, but the FAQ indicates that one should not use unix pipes in Hadoop Streaming.

https://hadoop.apache.org/docs/current/hadoop-streaming/HadoopStreaming.html#Frequently_Asked_Questions