Mapreduce job with mixed output endpoints: S3 and HDFS

Question

I have a MR job running in EMR and it stores outputs currently in S3. The output from reducer will be the input to the same mapper (think identity mapper) and I would like to execute the successive runs as fast as possible instead of waiting for EMR to write to S3 and then schedule the mapper after 'x' mins to read the data. Write to and Read from S3 takes a significant time (~3--5 mins) and so I would like to know if there a way to avoid reading from S3 for my successive runs?

Also I need to write the output of a mapreduce job to S3 because that data is of importance to me and need to be persisted. However for each successive MR run I do not want to read from S3, instead can I write it to HDFS (or cache) and then use that as my input for the next run?

MultipleOutputs - help in outputting data to multiple files in a folder or writing to multiple folders. See - Writing output to different folders hadoop

How can I extend this concept to write to two different endpoints - S3 & HDFS?

naveenkumarbv naveenkumarbv · Accepted Answer · 2017-10-10T04:53:56

Based on your question, let's assume that you want to read the input data from S3 in the first job, perform computations using one or a few series of intermediate MapReduce jobs that read/write data to HDFS and the last job to write to S3.

You can read data from and write data to different endpoints(S3 or HDFS) based on your implementation.

If you don't specify a scheme for input/output paths in a MapReduce job, it is defaulted to HDFS. But, you can also specify the URI scheme for the paths using the prefixes hdfs:// for Hadoop Distributed File System and s3:// for Amazon S3 buckets. You can use s3n://, s3a:// and s3:// based on your requirement. Please refer to this link for more information on s3 buckets: Technically what is the difference between s3n, s3a and s3?

Irrespective of the input/output endpoints for a MapReduce job, we can use the FileSystem class to create an object by passing java.net.URI and org.apache.hadoop.conf.Configuration classes as arguments. Please refer to the pseudo code below:

FileSystem fileSystem = FileSystem.get(new URI(""), configuration);

Now, in your driver code, if your job uses only one endpoint, you need to create just one fileSystem object, and if you have two different endpoints you can create two fileSystem objects for input and output. You can use the fileSystem object to perform any kind of operation as needed.

If you have multiple MapReduce jobs designed as above, you can just call the jobs using these commands in the order you want "hadoop jar jar_name main_class input_path output_path", or just a Oozie workflow.

Mapreduce job with mixed output endpoints: S3 and HDFS

1 Answers