Here is my goal: create on-demand Hadoop clusters (number of nodes to specify on-the-fly) using EMR5.3.0 or EMR 5.4.0 with Spark 2.1.0 through AWS CLI while storing the input and output data in S3 without worrying about managing a 24*7 cluster or HDFS for data storage.
Here are my challenges / questions: a) I can do the above using 'aws create-cluster' commands and specify number of nodes? Is that correct? For example, if I specify the parameter
--instance-count 10
this will create one master node and 9 core nodes?
b) If I use 'aws create-cluster', can I add more worker nodes (I guess it's called task nodes) on the fly to speed up the job, using CLI?
c) If I install Anaconda and other software on the cluster (i.e. Master) and then save the Master and all slave nodes as AMI, can I still launch on-demand Hadoop cluster from these AMIs with a different number of nodes which I can specify on-the-fly with AWS CLI?
Thank you. Appreciate your feedback.