Opening files on HDFS from Hadoop mapreduce job

Question

Usually, I can open a new file with something like this:

aDict = {}
with open('WordLists/positive_words.txt', 'r') as f:
    aDict['positive'] = {line.strip() for line in f}

with open('WordLists/negative_words.txt', 'r') as f:
    aDict['negative'] = {line.strip() for line in f}

This will open up the two relevant text files in the WordLists folder and append each line to the dictionary as either positive or negative.

When I want to run a mapreduce job within Hadoop however, I don't think this works. I am running my program like so:

./hadoop/bin/hadoop jar contrib/streaming/hadoop-streaming-1.1.2.jar -D mapred.reduce.tasks=0 -file hadoop_map.py -mapper hadoop_reduce.py -input /toBeProcessed -output /Completed

I have tried to change the code to this:

with open('/mapreduce/WordLists/negative_words.txt', 'r')

where mapreduce is a folder on the HDFS, with WordLists a subfolder containing negative words. But my program doesn't find this. Is what I'm doing possible and if so, what is the correct way to load files on the HDFS.

Edit

I've now tried:

with open('hdfs://localhost:9000/mapreduce/WordLists/negative_words.txt', 'r')

This seems to do something, but now I get this sort of output:

13/08/27 21:18:50 INFO streaming.StreamJob:  map 0%  reduce 0%
13/08/27 21:18:50 INFO streaming.StreamJob:  map 50%  reduce 0%
13/08/27 21:18:50 INFO streaming.StreamJob:  map 0%  reduce 0%

Then a job fail. So still not right. Any ideas?

Edit 2:

Having re-read the API, I notice I can use the -files option in the terminal to specify files. The API states:

The -files option creates a symlink in the current working directory of the tasks that points to the local copy of the file.

In this example, Hadoop automatically creates a symlink named testfile.txt in the current working directory of the tasks. This symlink points to the local copy of testfile.txt.

-files hdfs://host:fs_port/user/testfile.txt

Therefore, I run:

./hadoop/bin/hadoop jar contrib/streaming/hadoop-streaming-1.1.2.jar -D mapred.reduce.tasks=0 -files hdfs://localhost:54310/mapreduce/SentimentWordLists/positive_words.txt#positive_words -files hdfs://localhost:54310/mapreduce/SentimentWordLists/negative_words.txt#negative_words -file hadoop_map.py -mapper hadoop_map.py -input /toBeProcessed -output /Completed

From my understanding of the API, this creates symlinks so I can use "positive_words" and "negative_words" in my code, like this:

with open('negative_words.txt', 'r')

However, this still doesn't work. Any help anyone can offer would be hugely appreciated as I can't do much until I solve this.

Edit 3:

I can use this command:

-file ~/Twitter/SentimentWordLists/positive_words.txt

along with the rest of my command to run the Hadoop job. This finds the file on my local system rather than HDFS. This doesn't throw any errors, so it's accepted somewhere as a file. However, I've no idea how to access the file.

Try -files hdfs://localhost:54310/mapreduce/SentimentWordLists/positive_words.txt#positive_words,hdfs://localhost:54310/mapreduce/SentimentWordLists/negative_words.txt#negative_words — Alfonso Nishikawa
Tried that one too. I've deleted all code within the mapper class that touches those files to see if the actual import is even working - it isn't! That just brings up the same streaming command error — Andrew Martin

Alfonso Nishikawa Alfonso Nishikawa · Accepted Answer · 2013-08-28T09:12:58

Solution after plenty comments :)

Reading a data file in python: send it with -file and add to your script the following:

import sys

Sometimes is needed to add after the import:

sys.path.append('.')

(related to @DrDee comment in Hadoop Streaming - Unable to find file error)

Opening files on HDFS from Hadoop mapreduce job

2 Answers