I'm writing an application in Java on Hadoop 1.1.1 (Ubuntu) that compares strings in order to find the longest common substrings. I've got both the map and reduce phases running successfully for small data sets. Whenever I increase the size of the input, my reduce output never appears in my target output directory. It doesn't complain at all which makes this all the weirder. I'm running everything in Eclipse and I have 1 mapper and 1 reducer.
My reducer finds the longest common substring in a collection of strings and then emits the substring as the key and the index of the string that contained it as the value. I've got a short example.
Input Data
0: ALPHAA
1: ALPHAB
2: ALZHA
Output Emitted
Key: ALPHA Value: 0
Key: ALPHA Value: 1
Key: AL Value: 0
Key: AL Value: 1
Key: AL Value: 2
The first two input strings both share "ALPHA" as the common substring while all three share "AL". I end up indexing the substrings and write them into a database when the process is complete.
An additional observation, I can see that intermediate files are created in my output directory, it's just that the reduced data is never put into an output file.
I've pasted the Hadoop output log below and it claims that it has a number of output records from the reducer, it's just that they seem to disappear. Any suggestions are appreciated.
Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
Total input paths to process : 1
Running job: job_local_0001
setsid exited with exit code 0
Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@411fd5a3
Snappy native library not loaded
io.sort.mb = 100
data buffer = 79691776/99614720
record buffer = 262144/327680
map 0% reduce 0%
Spilling map output: record full = true
bufstart = 0; bufend = 22852573; bufvoid = 99614720
kvstart = 0; kvend = 262144; length = 327680
Finished spill 0
Starting flush of map output
Finished spill 1
Merging 2 sorted segments
Down to the last merge-pass, with 2 segments left of total size: 28981648 bytes
Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting
Task attempt_local_0001_m_000000_0 done.
Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@3aff2f16
Merging 1 sorted segments
Down to the last merge-pass, with 1 segments left of total size: 28981646 bytes
map 100% reduce 0%
reduce > reduce
map 100% reduce 66%
reduce > reduce
map 100% reduce 67%
reduce > reduce
reduce > reduce
map 100% reduce 68%
reduce > reduce
reduce > reduce
reduce > reduce
map 100% reduce 69%
reduce > reduce
reduce > reduce
map 100% reduce 70%
reduce > reduce
job_local_0001
Job complete: job_local_0001
Counters: 22
File Output Format Counters
Bytes Written=14754916
FileSystemCounters
FILE_BYTES_READ=61475617
HDFS_BYTES_READ=97361881
FILE_BYTES_WRITTEN=116018418
HDFS_BYTES_WRITTEN=116746326
File Input Format Counters
Bytes Read=46366176
Map-Reduce Framework
Reduce input groups=27774
Map output materialized bytes=28981650
Combine output records=0
Map input records=4629524
Reduce shuffle bytes=0
Physical memory (bytes) snapshot=0
Reduce output records=832559
Spilled Records=651304
Map output bytes=28289481
CPU time spent (ms)=0
Total committed heap usage (bytes)=2578972672
Virtual memory (bytes) snapshot=0
Combine input records=0
Map output records=325652
SPLIT_RAW_BYTES=136
Reduce input records=27774
reduce > reduce
reduce > reduce