Amazon EMR: Initializing a cluster with data

Question

I am using Amazon EMR and I'm able to create and run jobflows using the CLI tool. Jobs run fine. However I'm running into a problem when trying to load data into my EMR cluster's HDFS from both S3 and the name node's local filesystem.

I would like to populate HDFS from S3. I'm trying to use the S3DistCp tool do this. I'm running this command:

elastic-mapreduce --jobflow $JOBFLOWID --jar s3://us-east-1.elasticmapreduce/libs/s3distcp/1.0.1/s3distcp.jar --arg --src --arg 's3n://my-bucket/src' --arg --dest --arg 'hdfs:///my-emr-hdfs/dest/'

I'm getting two errors, probably related, in the logs. In the mapreduce job output the job completes to 100% but fails at the end:

INFO org.apache.hadoop.mapred.JobClient (main):     Map output records=184
ERROR com.amazon.elasticmapreduce.s3distcp.S3DistCp (main): 21 files failed to copy

In the name node daemon logs I'm getting this Exception:

INFO org.apache.hadoop.ipc.Server (IPC Server handler 13 on 9000): IPC Server handler 13 on 9000, call addBlock(/mnt/var/lib/hadoop/tmp/mapred/system/jobtracker.info, DFSClient_-1580223521, null) from xx.xx.xx.xx:xxxxx: error: java.io.IOException: File /mnt/var/lib/hadoop/tmp/mapred/system/jobtracker.info could only be replicated to 0 nodes, instead of 1
java.io.IOException: File /mnt/var/lib/hadoop/tmp/mapred/system/jobtracker.info could only be replicated to 0 nodes, instead of 1 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1531)
at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:685)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:563)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1388)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1384)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1382)

I've set dfs.replication=1 when creating the jobflow. My nodes are c1.mediums and the data I'm trying to push into HDFS is under 3GB. So it shouldn't be an out-of-disk issue. But maybe I'm missing something.

Two questions: 1) Any insight into why the S3DistCp is failing? 2) Second question is somewhat unrelated. Is it possible to create a jobflow where the very first job is an S3DistCp job to initialize the cluster with data?

Any insight appreciated. Thanks.

Update: My test below in the comments didnt seem to work. Here's some more info from the logs.

WARN org.apache.hadoop.hdfs.DFSClient (Thread-15): Error Recovery for block null bad datanode[0] nodes == null
WARN org.apache.hadoop.hdfs.DFSClient (Thread-15): Could not get block locations. Source file "/mnt/tmp/mapred/system/jobtracker.info" - Aborting...
WARN org.apache.hadoop.mapred.JobTracker (main): Writing to file hdfs://xx.xx.xx.xx:xxxx/mnt/tmp/mapred/system/jobtracker.info failed!
WARN org.apache.hadoop.mapred.JobTracker (main): FileSystem is not ready yet!
WARN org.apache.hadoop.mapred.JobTracker (main): Failed to initialize recovery manager.

EMR boots up with a deprecated hadoop-site.xml file and it looks like it may be using this and the hadoop.tmp.dir config defined in it, which points to /tmp. /tmp is much smaller. I'm going to try overriding that config. It may not work though. forums.aws.amazon.com/thread.jspa?threadID=32108 — Girish Rao
name node daemon logs on EMR almost always start up with those errors in the logs. I submit that those are red herrings. You can do anything you want in an EMR step, as long as you have a main method in a jar for Hadoop to call. But what is wrong with using s3n:// (the native s3 filesystem) for map inputs directly? — Judge Mental
Thanks for this insight @JudgeMental. It does seem that all my data files get into EMR HDFS okay. But the s3distcp job goes into a FAILED state every time so it's off-putting to see this. — Girish Rao
@JudgeMental In regards to your s3n question, my jobflow has many jobs, around 20, some of which are run multiple times per day. So I was leaning towards keeping a set of 10 CORE instances up 24/7 so as to avoid transferring gigs of data back/forth with s3n (takes more time). The above errors made me want to minimize using s3 transfer. But maybe the speed of s3distcp and reliability of s3 as storage outweigh the transfer costs? — Girish Rao
I was getting the same error in a later job step and realized the storage on the data nodes is not what they're documented to be. It's looking more like an actual storage space problem. I've posted a question here: stackoverflow.com/questions/10856190/… — Girish Rao

Ramya Ramya · Accepted Answer · 2012-06-15T23:57:28

For the first query "Jobtracker.info could only be replicated to 0 nodes, instead of 1" hope this helps: http://wiki.apache.org/hadoop/FAQ#What_does_.22file_could_only_be_replicated_to_0_nodes.2C_instead_of_1.22_mean.3F Copying from the above link:

3.13. What does "file could only be replicated to 0 nodes, instead of 1" mean?

The NameNode does not have any available DataNodes. This can be caused by a wide variety of reasons. Check the DataNode logs, the NameNode logs, network connectivity, ... Please see the page: CouldOnlyBeReplicatedTo

I was facing similar issue while trying to deploy single node cluster when there was a delay in starting up data node daemon

Amazon EMR: Initializing a cluster with data

1 Answers