file block replication allocation on virtual machines hadoop

Question

i wanna try to implement this paper work, that i got from IEEE "Location-Aware MapReduce in Virtual Cloud". here the summary: 8 physical machines, each machine containing 4 virtual machines, each VM is installed hadoop hdfs. suppose we have cluster containing p physical machines, each has a harddisk and replica number is 3. then n file blocks are put into the cluster from another computer out of the cluster or generated randomly in cluster. the model is about data pattern generation and task pattern generation with a certain data pattern. each block has the same probability to be placed on physical machines that host the same number virtual machines. a data pattern may occurs, using hadoop strategy, a file block replica all stack in one physical machine, since hadoop's strategy data allocation is random. http://imageshack.us/photo/my-images/42/allstack.png/

the proposed strategy is round-robin allocation and serpentine allocation, like this in theory: http://imageshack.us/photo/my-images/43/proposed.png/

how to make hadoop aware that some number virtual machines is on one physical machines??

to make hadoop not to replicate 2nd and 3rd replica of a file block onto virtual machines that on same physical machine??? i have asked about how to implement like that, and got reply, it is using rack awareness configuration. but i still confused and need more references about that.

how could i trace those data, those file blocks replication distributed evenly on physical machines, ensuring that there aren't file block replicas all stack on one physical machine ?? is it sure if i config following that rack awareness, file blocks replicas distributed evenly on physical machines??

Syntharz Tech Team Syntharz Tech Team · Accepted Answer · 2013-05-01T18:02:33

Assumption: We are aware about which Virtual Machine is getting created on which Physical Machine.

This assumption does not hold true in public cloud environments. Hence below described solution can not work there. Below described solution will work in Private clouds

Implementing rack awareness involves 2 steps

Setup script file name in core-site.xml

 <property>
      <name>topology.script.file.name</name>
      <value>core/rack-awareness.sh</value>
 </property>

Implementing the script

A sample rack-awareness.sh can be like below

HADOOP_CONF=/etc/hadoop/conf
while [ $# -gt 0 ] ; do
  nodeArg=$1
  exec< ${HADOOP_CONF}/cluster.data
  result=”" 
  while read line ; do
     ar=( $line ) 
     if [ "${ar[0]}” = “$nodeArg” ] ; then
       result=”${ar[1]}”
     fi
  done 
  shift 
  if [ -z "$result" ] ; then
     echo -n “/default/rack “
  else
     echo -n “$result “
  fi
done

And the contents of cluster.data can be

hadoopdata1.ec.com     /dc1/rack1
hadoopdata1            /dc1/rack1
10.1.1.1               /dc1/rack2

As you can see that Hadoop is totally dependent on the rack values provided by us. You can use this fact for distributing data blocks on virtual machines which exist on different physical machines.

For eg.

Virtual Machine 1 (VM1) 10.83.51.2 is on Physical Machine 1 (PM1)
Virtual Machine 2 (VM2) 10.83.51.3 is on Physical Machine 1 (PM1)
Virtual Machine 2 (VM2) 10.83.51.4 is on Physical Machine 2 (PM2)

You can have cluster.data as

10.83.51.2 /DC1/rack1
10.83.51.3 /DC1/rack1
10.83.51.4 /DC1/rack2

file block replication allocation on virtual machines hadoop

1 Answers