As we need to read in bunch of files to mapper, in non-Hadoop
environment, I use os.walk(dir) and file=open(path, mode) to read in
each file.
However, in Hadoop environment, as I read that HadoopStreaming convert file input to stdin of mapper and conver stdout of reducer to file output, I have a few questions about how to input file:
Do we have to set input from STDIN in mapper.py and let HadoopStreaming convert files in hdfs input directory to STDIN?
If I want to read in each file separately and parse each line, how can I set input from file in mapper.py?
My previous Python code for non-Hadoop environment sets: for root, dirs, files in os.walk('path of non-hdfs') .....
However, in Hadoop environment, I need to change 'path of non-hdfs' to
a path of HDFS where I copyFromLocal to, but I tried many with no
success, such as os.walk('/user/hadoop/in') -- this is what I checked
by running bin/hadoop dfs -ls, and os.walk('home/hadoop/files')--this
is my local path in non-Hadoop environment, and even os.walk('hdfs://
host:fs_port/user/hadoop/in')....
Can anyone tell me whether I can input from file by using file operation in mapper.py or I have to input from STDIN?
Thanks.