3
votes

I am learning from this course. It asks to create a new hdinsight cluster (options are hadoop, hbase, storm or spark) and also a storage account. What is difference between a cluster and a storage account? Does cluster include processors to process my jobs and does storage account mean space to store my data? Why cannot i connect the same storage account with different clusters?

Also under Microsoft Azure >> New >> Data + Analytics, I see 2 options : hdinsight, data lake analytics that deal with big data. What is difference between those two? Both of them look similar

HDInsight Microsoft's cloud-based Big Data service. Apache Hadoop and other popular Big Data solutions.

Data Lake Analytics Big data analytics made easy

1

1 Answers

3
votes

There are a lot of questions in here so let me answer them 1 by 1.

What is Blob Storage vs HDInsight Cluster? Blob storage is a distributed file store very similar to HDFS and is used to store data/videos/things. A HDInsight cluster is a number of Hadoop virtual machines created to run Map Reduce code over a DFS (HDFS or Blob storage). Having two separate services allow you to scale each independently, saving money in the long term. Data storage is cheap but a 500 node VM cluster can get pricey quickly. Being able to kill the cluster but keep your data is helpful.

Why can't I connect the same storage account with different clusters? You can have multiple clusters pointed at the same storage account but it's an Anti pattern. Storage accounts have Data and IO limits and if you have multiple clusters pulling against a single storage account, it's more probable you'll hit them. Also, storage accounts only cost $$ if you have data in them so having multiple isn't a cost increase.

What is Azure Data Lake(ADL) and ADL storage? Azure data lake is another option for both storage and compute. ADL storage can be thought of as blob storage v2. You get an increase of some of the limits on IO and file size from blob storage, while still being able to use Hadoop for compute. ADL is a second option for compute that is completely different then Hadoop. You don't have to worry about the cluster creation or clusters in general. You write a query, specify the amount of parallelization you'd like, and the data is returned.

References:

https://azure.microsoft.com/en-us/documentation/articles/azure-subscription-service-limits/#storage-limits

https://azure.microsoft.com/en-us/services/hdinsight/

https://azure.microsoft.com/en-us/solutions/data-lake/