2
votes

Loading many small files (>200000, 4kbyte) from a S3 Bucket into HDFS via Hive or Pig on AWS EMR is extremely slow. It seems that only one mapper is used to get the data, though I cannot exactly figure out where the bottleneck is.

Pig Code Sample

data = load 's3://data-bucket/'  USING PigStorage(',') AS (line:chararray)

Hive Code Sample

CREATE EXTERNAL TABLE data (value STRING) LOCATION  's3://data-bucket/';

Are there any known settings that speed up the process or increase the number of mappers used to fetch the data?

I tried the following without any noticeable effects:

  • Increase #Task Nodes
  • set hive.optimize.s3.query=true
  • manually set #mappers
  • Increase instance type from medium up to xlarge

I know that s3distcp would speed up the process, but I could only get better performance by doing a lot of tweaking including setting #workerThreads and would prefer changing parameters directly in my PIG/Hive scripts.

1

1 Answers

2
votes

You can either :

  1. use distcp to merge the file before your job starts : http://snowplowanalytics.com/blog/2013/05/30/dealing-with-hadoops-small-files-problem/

  2. have a pig script that will do it for you, once.

If you want to do it through PIG, you need to know how many mappers are spawned. You can play with the following parameters :

//  to set mapper = nb block size. Set to true for one per file.
SET pig.noSplitCombination false;
// set size to have SUM(size) / X = wanted number of mappers
SET pig.maxCombinedSplitSize 250000000;

Please provide metrics for thoses cases