4
votes

I ran a test AWS EMR job with a custom mapper but with NONE as the reducer. I got the (expected) output in 13 separate "part" files. How can I combine them into a single file?

I don't need to aggregate data in any special way, and I don't care if it is sorted, re-ordered arbitrarily, or left in order. But I would like to efficiently put the data back into a single file. Do I have to do that manually, or is there a way to do it as part of the EMR Cluster?

It's very strange to me that there isn't a default option or some sort of automatic step available for this. I've read a bit about the Identity Reducer. Does it do what I want, and if so, how do I use it when launching a cluster through the EMR console?

My data is in S3.


EDIT

To be very clear, I can run cat on all of the output parts after the job is done, if that's what I have to do. Locally, or on an EC2 instance, or whatever. Is that really what everyone does?

1
Have a look at this aws forum thread. - Don Roby
@DonRoby Yes saw that. It's 5 years old and not very helpful. How do I supply -jobconf mapred.reduce.tasks=1 when launching jobs through the console? How can I tell if it's a bad idea or not? - jmilloy
I didn't make it an answer because I haven't had a chance to try it out myself. Probably the best way to know if it's a good idea or not is to try it with your actual situation. It sounds like the effect on performance varies widely depending on whether the code you already have actually reduces the volume of data. - Don Roby
@DonRoby Okay. Since it's a mapper with no reduce, it doesn't reduce the volume of data at all. In fact it increases the volume significantly by appending results to input lines. I'm hoping someone can tell to me whether run a big cat on my data manually is faster than anything Amazon EMR can do. - jmilloy

1 Answers

3
votes

If the output of the mapper part files itself are small then you could try using hadoop fs -getmerge to merge them to local filesystem:

hadoop fs -getmerge s3n://BUCKET/path/to/output/ [LOCAL_FILE]

And then put the merged file back to S3:

hadoop fs -put [LOCAL_FILE] s3n://BUCKET/path/to/put/

For the above commands to work you should have the following properties set in core-site.xml

<property>
  <name>fs.s3n.awsAccessKeyId</name>
  <value>YOUR_ACCESS_KEY</value>
</property>

<property>
  <name>fs.s3n.awsSecretAccessKey</name>
  <value>YOUR_SECRET_ACCESS_KEY</value>
</property>