3
votes

I'm using Amazon EMR and I'm able to run most jobs fine. I'm running into a problem when I start loading and generating more data within the EMR cluster. The cluster runs out of storage space.

Each data node is a c1.medium instance. According to the links here and here each data node should come with 350GB of instance storage. Through the ElasticMapReduce Slave security group I've been able to verify in my AWS Console that the c1.medium data nodes are running and are instance stores.

When I run hadoop dfsadmin -report on the namenode, each data node has about ~10GB of storage. This is further verified by running df -h

hadoop@domU-xx-xx-xx-xx-xx:~$ df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda1             9.9G  2.6G  6.8G  28% /
tmpfs                 859M     0  859M   0% /lib/init/rw
udev                   10M   52K   10M   1% /dev
tmpfs                 859M  4.0K  859M   1% /dev/shm

How can I configure my data nodes to launch with the full 350GB storage? Is there a way to do this using a bootstrap action?

1

1 Answers

7
votes

After more research and posting on the AWS forum I got a solution although not a full understanding of what happened under the hood. Thought I would post this as an answer if that's okay.

Turns out there is a bug in the AMI Version 2.0, which of course was the version I was trying to use. (I had switched to 2.0 because I wanted hadoop 0.20 to be the default) The bug in AMI Version 2.0 prevents mounting of instance storage on 32-bit instances, which is what the c1.mediums launch as.

By specifying on the CLI tool that the AMI Version should use "latest", the problem was fixed and each c1.medium launched with the appropriate 350GB of storage.

For example

./elastic-mapreduce --create --name "Job" --ami-version "latest" --other-options

More information about using AMIs and "latest" can be found here. Currently "latest" is set to AMI 2.0.4. AMI 2.0.5 is the most recent release but looks like it is also still a little buggy.