I am using Amazon EMR and I'm able to create and run jobflows using the CLI tool. Jobs run fine. However I'm running into a problem when trying to load data into my EMR cluster's HDFS from both S3 and the name node's local filesystem.
I would like to populate HDFS from S3. I'm trying to use the S3DistCp tool do this. I'm running this command:
elastic-mapreduce --jobflow $JOBFLOWID --jar s3://us-east-1.elasticmapreduce/libs/s3distcp/1.0.1/s3distcp.jar --arg --src --arg 's3n://my-bucket/src' --arg --dest --arg 'hdfs:///my-emr-hdfs/dest/'
I'm getting two errors, probably related, in the logs. In the mapreduce job output the job completes to 100% but fails at the end:
INFO org.apache.hadoop.mapred.JobClient (main): Map output records=184
ERROR com.amazon.elasticmapreduce.s3distcp.S3DistCp (main): 21 files failed to copy
In the name node daemon logs I'm getting this Exception:
INFO org.apache.hadoop.ipc.Server (IPC Server handler 13 on 9000): IPC Server handler 13 on 9000, call addBlock(/mnt/var/lib/hadoop/tmp/mapred/system/jobtracker.info, DFSClient_-1580223521, null) from xx.xx.xx.xx:xxxxx: error: java.io.IOException: File /mnt/var/lib/hadoop/tmp/mapred/system/jobtracker.info could only be replicated to 0 nodes, instead of 1
java.io.IOException: File /mnt/var/lib/hadoop/tmp/mapred/system/jobtracker.info could only be replicated to 0 nodes, instead of 1 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1531)
at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:685)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:563)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1388)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1384)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1382)
I've set dfs.replication=1 when creating the jobflow. My nodes are c1.mediums and the data I'm trying to push into HDFS is under 3GB. So it shouldn't be an out-of-disk issue. But maybe I'm missing something.
Two questions: 1) Any insight into why the S3DistCp is failing? 2) Second question is somewhat unrelated. Is it possible to create a jobflow where the very first job is an S3DistCp job to initialize the cluster with data?
Any insight appreciated. Thanks.
Update: My test below in the comments didnt seem to work. Here's some more info from the logs.
WARN org.apache.hadoop.hdfs.DFSClient (Thread-15): Error Recovery for block null bad datanode[0] nodes == null
WARN org.apache.hadoop.hdfs.DFSClient (Thread-15): Could not get block locations. Source file "/mnt/tmp/mapred/system/jobtracker.info" - Aborting...
WARN org.apache.hadoop.mapred.JobTracker (main): Writing to file hdfs://xx.xx.xx.xx:xxxx/mnt/tmp/mapred/system/jobtracker.info failed!
WARN org.apache.hadoop.mapred.JobTracker (main): FileSystem is not ready yet!
WARN org.apache.hadoop.mapred.JobTracker (main): Failed to initialize recovery manager.