Optimal File Size of S3 Files for Hadoop Job on EMR?

Question

I am trying to determine the ideal size for a file stored in S3 that will be used in Hadoop jobs on EMR.

Currently I have large text files around 5-10gb. I am worried about the delay in copying these large files to HDFS to run MapReduce jobs. I have the option of making these files smaller.

I know S3 files are copied in parallel to HDFS when using S3 as an input directory in MapReduce jobs. But will a single large file be copied to HDFS using single thread, or will this file be copied as multiple parts in parallel? Also, does Gzip compression affect copying a single file in multiple parts?

John Rotenstein John Rotenstein · Accepted Answer · 2016-11-04T07:01:13

There are two factors to consider:

Compressed files cannot be split between tasks. For example, if you have a single, large, compressed input file, only one Mapper can read it.
Using more, smaller files makes parallel processing easier but there is more overhead when starting the Map/Reduce jobs for each file. So, fewer files are faster.

Thus, there is a trade-off between the size and quantity of files. The recommended size is listed in a few places:

The Amazon EMR FAQ recommends:

If you are using GZIP, keep your file size to 1–2 GB because GZIP files cannot be split.

The Best Practices for Amazon EMR whitepaper recommends:

That means that a single mapper (a single thread) is responsible for fetching the data from Amazon S3. Since a single thread is limited to how much data it can pull from Amazon S3 at any given time (throughput), the process of reading the entire file from Amazon S3 into the mapper becomes the bottleneck in your data processing workflow. On the other hand, if your data files can be split, more than a single mapper can process your file. The suitable size for such data files is between 2 GB and 4 GB.

The main goal is to keep all of your nodes busy by processing as many files in parallel as possible, without introducing too much overhead.

Oh, and keep using compression. The savings in disk space and data transfer time makes it more advantageous than enabling splitting.

Optimal File Size of S3 Files for Hadoop Job on EMR?

1 Answers