0
votes

If I am running an EMR job (in Java) on Amazon Web Services to process large amounts of data, is it possible to have every single mapper access a small file stored on S3? Note that the small file I am talking about is NOT the input to the mappers. Rather, the mappers need to process the input according to some rules in the small file. Maybe the large input file is a billion lines of text, for example, and I want to filter out words that are in a blacklist or something by reading a small file of blacklisted words stored in an S3 bucket. In this case, each mapper would process different parts of the input data, but they would all need to access the restricted words file on S3. How can I make the mappers do this in Java?

EDIT: I am not using the Hadoop framework, so there is no setup() or map() method calls. I am simply using the streaming EMR service and reading stdin line by line from input file.

1

1 Answers

0
votes

You can access any S3 object within an mapper using S3 protocol directly. Eg. s3://mybucket/pat/to/file.txt

http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-file-systems.html .

You can actually use S3 to access your mapper's input files as well as any ad hoc lookup file as you are thinking to use. Previously these were differentiated by the use of s3n:// protocol for s3 object use and s3bfs:// for block storage. Now you dont have to differentiate and just use s3://

Alternatively, you can have an s3distcp step in the EMR cluster to copy the file - and make it available in hdfs. (this is not what you asked about but.. ) http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.html