1
votes

Turns out that copying a large file (~6 GB) from S3 to every node in an Elastic MapReduce cluster in a bootstrap action doesn't scale well; the pipe is only so big, and downloads to the nodes get throttled as # nodes gets large.

I'm running a job flow with 22 steps, and this file is needed by maybe 8 of them. Sure, I can copy from S3 to HDFS and cache the file before every step, but that's a major speed kill (and can affect scalability). Ideally, the job flow would start with the file on every node.

There are StackOverflow questions at least obliquely addressing persisting a cached file through a job flow: Re-use files in Hadoop Distributed cache, Life of distributed cache in Hadoop .

I don't think they help me. Anyone have some fresh ideas?

1
That depends on what you want to do with the file, on lots of cases (including MR jobs, hive query, etc) EMR can use the file directly on S3, without downloading it to the local nodes. Would that be useful in your context?Julio Faerman
The file has to be on every node; it's nonnegotiable. It's needed by a particular executable.verve

1 Answers

1
votes

Two ideas, please consider your case specifics and disregard at will:

  • Share the file through NFS with a server with a instance type with good enough networking on the same placement group or AZ.
  • Have EBS PIOPS volumes and EBS-Optimized instances with the file pre-loaded and just attach them to your nodes in a bootstrap action.