I ran a test AWS EMR job with a custom mapper but with NONE as the reducer. I got the (expected) output in 13 separate "part" files. How can I combine them into a single file?
I don't need to aggregate data in any special way, and I don't care if it is sorted, re-ordered arbitrarily, or left in order. But I would like to efficiently put the data back into a single file. Do I have to do that manually, or is there a way to do it as part of the EMR Cluster?
It's very strange to me that there isn't a default option or some sort of automatic step available for this. I've read a bit about the Identity Reducer. Does it do what I want, and if so, how do I use it when launching a cluster through the EMR console?
My data is in S3.
EDIT
To be very clear, I can run cat on all of the output parts after the job is done, if that's what I have to do. Locally, or on an EC2 instance, or whatever. Is that really what everyone does?
-jobconf mapred.reduce.tasks=1when launching jobs through the console? How can I tell if it's a bad idea or not? - jmilloycaton my data manually is faster than anything Amazon EMR can do. - jmilloy