4
votes

In order to try to work around performance issues with Amazon EMR, I'm trying to use s3distcp to copy files over from S3 to my EMR cluster for local processing. As a first test, i'm copying over one day's data, 2160 files, from a single directory, using the --groupBy option to collapse them into one (or a few) files.

The job seems to run just fine, showing me the map/reduce progressing to 100%, but at this point the process hangs and never comes back. How can I figure out what's going on?

Source files are GZipped text files stored in S3, each one about 30kb. This is a vanilla Amazon EMR cluster, and I'm running s3distcp from the shell of the master node.

hadoop@ip-xxx:~$ hadoop jar /home/hadoop/lib/emr-s3distcp-1.0.jar --src s3n://xxx/click/20140520 --dest hdfs:////data/click/20140520 --groupBy ".*(20140520).*" --outputCodec lzo
14/05/21 20:06:32 INFO s3distcp.S3DistCp: Running with args: [Ljava.lang.String;@26f3bbad
14/05/21 20:06:35 INFO s3distcp.S3DistCp: Using output path 'hdfs:/tmp/9f423c59-ec3a-465e-8632-ae449d45411a/output'
14/05/21 20:06:35 INFO s3distcp.S3DistCp: GET http://169.254.169.254/latest/meta-data/placement/availability-zone result: us-west-2b
14/05/21 20:06:35 INFO s3distcp.S3DistCp: Created AmazonS3Client with conf KeyId AKIAJ5KT6QSV666K6KHA
14/05/21 20:06:37 INFO s3distcp.FileInfoListing: Opening new file: hdfs:/tmp/9f423c59-ec3a-465e-8632-ae449d45411a/files/1
14/05/21 20:06:38 INFO s3distcp.S3DistCp: Created 1 files to copy 2160 files
14/05/21 20:06:38 INFO mapred.JobClient: Default number of map tasks: null
14/05/21 20:06:38 INFO mapred.JobClient: Setting default number of map tasks based on cluster size to : 72
14/05/21 20:06:38 INFO mapred.JobClient: Default number of reduce tasks: 3
14/05/21 20:06:39 INFO security.ShellBasedUnixGroupsMapping: add hadoop to shell userGroupsCache
14/05/21 20:06:39 INFO mapred.JobClient: Setting group to hadoop
14/05/21 20:06:39 INFO mapred.FileInputFormat: Total input paths to process : 1
14/05/21 20:06:39 INFO mapred.JobClient: Running job: job_201405211343_0031
14/05/21 20:06:40 INFO mapred.JobClient:  map 0% reduce 0%
14/05/21 20:06:53 INFO mapred.JobClient:  map 1% reduce 0%
14/05/21 20:06:56 INFO mapred.JobClient:  map 4% reduce 0%
14/05/21 20:06:59 INFO mapred.JobClient:  map 36% reduce 0%
14/05/21 20:07:00 INFO mapred.JobClient:  map 44% reduce 0%
14/05/21 20:07:02 INFO mapred.JobClient:  map 54% reduce 0%
14/05/21 20:07:05 INFO mapred.JobClient:  map 86% reduce 0%
14/05/21 20:07:06 INFO mapred.JobClient:  map 94% reduce 0%
14/05/21 20:07:08 INFO mapred.JobClient:  map 100% reduce 10%
14/05/21 20:07:11 INFO mapred.JobClient:  map 100% reduce 19%
14/05/21 20:07:14 INFO mapred.JobClient:  map 100% reduce 27%
14/05/21 20:07:17 INFO mapred.JobClient:  map 100% reduce 29%
14/05/21 20:07:20 INFO mapred.JobClient:  map 100% reduce 100%
[hangs here]

The job shows as:

hadoop@xxx:~$ hadoop job -list
1 job currently running
JobId   State   StartTime       UserName        Priority        SchedulingInfo
job_201405211343_0031   1       1400702799339   hadoop  NORMAL  NA

and there's nothing in the destination HDFS directory:

hadoop@xxx:~$ hadoop dfs -ls /data/click/

Any ideas?

2
are you sure it never goes back, or it just does the first bucket quickly and then takes for ever for the rest? This is what I notice.gae123

2 Answers

0
votes

hadoop@ip-xxx:~$ hadoop jar /home/hadoop/lib/emr-s3distcp-1.0.jar --src s3n://xxx/click/20140520**/** --dest hdfs:////data/click/20140520**/** --groupBy ".(20140520)." --outputCodec lzo

I faced a similar problem. All I need was place a extra slash at the end of the directories. And consequently, it completed and with stats showing, prev it hung at 100%

0
votes

use s3:// instead of s3n.

hadoop jar /home/hadoop/lib/emr-s3distcp-1.0.jar --src s3://xxx/click/20140520 --dest hdfs:////data/click/20140520 --groupBy ".(20140520)." --outputCodec lzo