Combine output files of MapReduce job

Question

I have written a Mapper and Reducer in Python and have executed it successfully on Amazon's Elastic MapReduce(EMR) using Hadoop Streaming.

The final result folder contains the output in three different files part-00000, part-00001 and part-00002. But I need the output as one single file. Is there a way I can do that?

Here is my code for the Mapper:

#!/usr/bin/env python

import sys

for line in sys.stdin:
    line = line.strip()
    words = line.split()
    for word in words:
        print '%s\t%s' % (word, 1)

And here is my code for the Reducer

#!/usr/bin/env python

from operator import itemgetter
import sys

current_word = None
current_count = 0
word = None
max_count=0

for line in sys.stdin:
    line = line.strip()
    word, count = line.split('\t', 1)

    try:
        count = int(count)
    except ValueError:
        continue

if current_word == word:
    current_count += count
else:
    if current_word:
        # write result to STDOUT
            if current_word[0] != '@':
                print '%s\t%d' % (current_word, current_count)
                if count > max_count:
                    max_count = count
    current_count = count
    current_word = word

if current_word == word:
    print '%s\t%s' % (current_word, current_count)

I need the output of this as one single file.

Can't you just open the three files and concatenate them into a single output file? — James Mills
That is what I have been doing. But I would like it if I can get a single output file after the Reduce phase. — Arun Kumar
Can't you just do (Linux/UNIX): cat part-00000 part-00001 part-00002 > output? — James Mills
Thanks James. That's one way. But I can't get EMR itself to spit it out as one single part file? — Arun Kumar

James Mills James Mills · Accepted Answer · 2013-12-14T09:08:37

A really simple way of doing this (assuming a Linux/UNIX sytem):

$ cat part-00000 part-00001 part-00002 > output

Combine output files of MapReduce job

4 Answers