2
votes

Here is my goal: create on-demand Hadoop clusters (number of nodes to specify on-the-fly) using EMR5.3.0 or EMR 5.4.0 with Spark 2.1.0 through AWS CLI while storing the input and output data in S3 without worrying about managing a 24*7 cluster or HDFS for data storage.

Here are my challenges / questions: a) I can do the above using 'aws create-cluster' commands and specify number of nodes? Is that correct? For example, if I specify the parameter

--instance-count 10    

this will create one master node and 9 core nodes?

b) If I use 'aws create-cluster', can I add more worker nodes (I guess it's called task nodes) on the fly to speed up the job, using CLI?

c) If I install Anaconda and other software on the cluster (i.e. Master) and then save the Master and all slave nodes as AMI, can I still launch on-demand Hadoop cluster from these AMIs with a different number of nodes which I can specify on-the-fly with AWS CLI?

Thank you. Appreciate your feedback.

2

2 Answers

2
votes

Using Autoscaling on AWS EMR, you can scale out and scale in nodes on a cluster. Scale out action can be triggered using Cloudwatch metrics(YARNMemoryAvailablePercentage and ContainerPendingRatio). Sample Policy below

"AutoScalingPolicy":
{
 "Constraints":
  {
   "MinCapacity": 10,
   "MaxCapacity": 50
  },

 "Rules":
 [
  {"Name": "Compute-scale-up",
   "Description": "Scale out based on ContainerPending Mterics",
   "Action":
    {
     "SimpleScalingPolicyConfiguration":
      {"AdjustmentType": "CHANGE_IN_CAPACITY",
       "ScalingAdjustment": 1,
       "CoolDown":0}
  },
   "Trigger":
    {"CloudWatchAlarmDefinition":
      {"AlarmNamePrefix": "compute-scale-up",
       "ComparisonOperator": "GREATER_THAN_OR_EQUAL",
       "EvaluationPeriods": 3,
       "MetricName": "ContainerPending",
       "Namespace": "AWS/ElasticMapReduce",
       "Period": 300,
       "Statistic": "AVERAGE",
       "Threshold": 10,
       "Unit": "COUNT",
       "Dimensions":
        [
          {"Key": "JobFlowId",
           "Value": "${emr:cluster_id}"}
        ]
      }
    }
  },
  {"Name": "Compute-scale-down",
   "Description": "Scale in",
   "Action":
    {
      "SimpleScalingPolicyConfiguration":
      {"AdjustmentType": "CHANGE_IN_CAPACITY",
       "ScalingAdjustment": -1,
       "CoolDown":300}
    },
   "Trigger":
    {"CloudWatchAlarmDefinition":
      {"AlarmNamePrefix": "compute-scale-down",
       "ComparisonOperator": "GREATER_THAN_OR_EQUAL",
       "EvaluationPeriods": 3,
       "MetricName": "MemoryAvailableMB",
       "Namespace": "AWS/ElasticMapReduce",
       "Period": 300,
       "Statistic": "AVERAGE",
       "Threshold": 24000,
       "Unit": "COUNT",
       "Dimensions":
        [
          {"Key": "JobFlowId",
           "Value": "${emr:cluster_id}"}
        ]
      }
    }
  }
 ]

}

You can refer this blog for more details https://aws.amazon.com/blogs/big-data/dynamically-scale-applications-on-amazon-emr-with-auto-scaling/

1
votes

a) I can do the above using 'aws create-cluster' commands and specify number of nodes? Is that correct? For example, if I specify the parameter...

Yes.

If I use 'aws create-cluster', can I add more worker nodes (I guess it's called task nodes) on the fly to speed up the job, using CLI?

Since your goal is to add on-demand instances on the fly, I would suggest you to look after reserved or spot instances (based on your use-case/cost).
We use spot instances with 50% of bid price and use the terminate policy as 'after the completion of job'.

If I install Anaconda and other software on the cluster (i.e. Master) and then save the Master and all slave nodes as AMI, can I still launch on-demand Hadoop cluster from these AMIs with a different number of nodes which I can specify on-the-fly with AWS CLI?

Yes you can .