1
votes

When crating a new EMR Job with an S3 bucket as the input source, is the data automatically copied from S3 into HDFS on the nodes? Or does the data remain solely in S3 and read when needed by map reduce jobs?

I get the impressions its the latter; but if data is stored in S3 and the processing done on provisioned EC2 instances, does this not go against a fundamental principle of map reduce: doing processing local to the data? As opposed to a more a traditional system: moving the data to where the processing is.

What are the relative implications of this approach given a reasonable large data set such as 1PB e.g. does the cluster take longer to start?

1

1 Answers

0
votes

You have both the option. You could stream data directly from Amazon S3 or first copy it to HDFS and then process it locally. The first way is good if you only intend to read the data once. And if your plan is to query the same input data multiple times, then you'd probably want to copy it to HDFS first.

And yes, by using S3 as an input to MapReduce you lose the data locality optimization. Also, if your plan is to use S3 as a replacement for HDFS, I would recommend you to go with S3 Block FileSystem instead of S3 Native FileSystem as it imposes a limit of 5GB on file size.

HTH