Apache Hadoop is designed to run across a bunch of commodity machines (nodes). This was not designed to be run in a cloud based complex scenarios. But, because cloud allows simulation of individual nodes through VMs , cloud based Hadoop clusters emerged. But that presents an understanding difficulty for me. When I study any standard explanation of a Hadoop cluster it's always on-prem architecture because all the Hadoop architecture is explained in terms of logical & simple on-prem view in mind. But this presents difficulty in understanding how a cloud based cluster works -- especially concepts such as HDFS, data locality etc. In on-prem version of explanation every node has its own 'local' storage (it also implies that storage hardware is fixed for a specific node, it doesn't get shuffled) and it's also is not assumed that the node is ever deleted. Also, we treat that storage as part of the node itself so we never think of killing a node and retaining storage for later use.
Now in cloud based Hadoop(HDInsight) model we can attach any Azure Storage account as primary storage for the cluster. So lets say if we have a cluster with 4 worker nodes and 2 head nodes , that single Azure Storage account acts as HDFS space for 6 virtual machines? And again , actual business data is not even stored on that -- it's stored on additional attached storage accounts. So I am not able to understand how does this get translated to on-prem Hadoop cluster? The core design of Hadoop cluster revolves around the concept of data locality , that data resides closest to processing. I know that when we create HDInsight cluster we create it in the same region as the storage accounts being attached. But it's more like multiple processing units (VMs) all sharing common storage rather than individual nodes with their own local storage. Probably , as long as it can accesses data fast enough (as though it resided locally) in the data center , it should not matter. But not sure if that's the case. The cloud based model presents the following picture to me:-
Can someone explain exactly how Apache Hadoop design gets translated into Azure based model ? The confusion arises from the fact that storage accounts are fixed and we can kill/spin cluster any time we want pointing to the same storage accounts.