0
votes

i wanna try to implement this paper work, that i got from IEEE "Location-Aware MapReduce in Virtual Cloud". here the summary: 8 physical machines, each machine containing 4 virtual machines, each VM is installed hadoop hdfs. suppose we have cluster containing p physical machines, each has a harddisk and replica number is 3. then n file blocks are put into the cluster from another computer out of the cluster or generated randomly in cluster. the model is about data pattern generation and task pattern generation with a certain data pattern. each block has the same probability to be placed on physical machines that host the same number virtual machines. a data pattern may occurs, using hadoop strategy, a file block replica all stack in one physical machine, since hadoop's strategy data allocation is random. http://imageshack.us/photo/my-images/42/allstack.png/

the proposed strategy is round-robin allocation and serpentine allocation, like this in theory: http://imageshack.us/photo/my-images/43/proposed.png/

how to make hadoop aware that some number virtual machines is on one physical machines??

to make hadoop not to replicate 2nd and 3rd replica of a file block onto virtual machines that on same physical machine??? i have asked about how to implement like that, and got reply, it is using rack awareness configuration. but i still confused and need more references about that.

how could i trace those data, those file blocks replication distributed evenly on physical machines, ensuring that there aren't file block replicas all stack on one physical machine ?? is it sure if i config following that rack awareness, file blocks replicas distributed evenly on physical machines??

1

1 Answers

0
votes

Assumption: We are aware about which Virtual Machine is getting created on which Physical Machine.

This assumption does not hold true in public cloud environments. Hence below described solution can not work there. Below described solution will work in Private clouds

Implementing rack awareness involves 2 steps

  1. Setup script file name in core-site.xml

     <property>
          <name>topology.script.file.name</name>
          <value>core/rack-awareness.sh</value>
     </property>
    
  2. Implementing the script

    A sample rack-awareness.sh can be like below

    HADOOP_CONF=/etc/hadoop/conf
    while [ $# -gt 0 ] ; do
      nodeArg=$1
      exec< ${HADOOP_CONF}/cluster.data
      result=”" 
      while read line ; do
         ar=( $line ) 
         if [ "${ar[0]}” = “$nodeArg” ] ; then
           result=”${ar[1]}”
         fi
      done 
      shift 
      if [ -z "$result" ] ; then
         echo -n “/default/rack “
      else
         echo -n “$result “
      fi
    done
    

    And the contents of cluster.data can be

    hadoopdata1.ec.com     /dc1/rack1
    hadoopdata1            /dc1/rack1
    10.1.1.1               /dc1/rack2
    

As you can see that Hadoop is totally dependent on the rack values provided by us. You can use this fact for distributing data blocks on virtual machines which exist on different physical machines.

For eg.

Virtual Machine 1 (VM1) 10.83.51.2 is on Physical Machine 1 (PM1)
Virtual Machine 2 (VM2) 10.83.51.3 is on Physical Machine 1 (PM1)
Virtual Machine 2 (VM2) 10.83.51.4 is on Physical Machine 2 (PM2)

You can have cluster.data as

10.83.51.2 /DC1/rack1
10.83.51.3 /DC1/rack1
10.83.51.4 /DC1/rack2