Turns out that copying a large file (~6 GB) from S3 to every node in an Elastic MapReduce cluster in a bootstrap action doesn't scale well; the pipe is only so big, and downloads to the nodes get throttled as # nodes gets large.
I'm running a job flow with 22 steps, and this file is needed by maybe 8 of them. Sure, I can copy from S3 to HDFS and cache the file before every step, but that's a major speed kill (and can affect scalability). Ideally, the job flow would start with the file on every node.
There are StackOverflow questions at least obliquely addressing persisting a cached file through a job flow: Re-use files in Hadoop Distributed cache, Life of distributed cache in Hadoop .
I don't think they help me. Anyone have some fresh ideas?