5
votes

When creating a HDInsights Hadoop cluster in Azure there are two storage options. Either Azure Data Lake Store (ADLS) or Azure Blob Storage.

What are the real differences between these two options and how do they affect the performance?

I found this page https://docs.microsoft.com/en-us/azure/data-lake-store/data-lake-store-comparison-with-blob-storage But it is not very specific, only uses very general terms like "ADLS is optimized for analytics".

Does it mean that its better for storing the HDInsights file system? And if ADLS is indeed faster then why not use it for non-analytics data as well?

3

3 Answers

4
votes

As per this document, an Azure Storage account can hold up to 4.75 TB, though individual blobs (or files from an HDInsight perspective) can only go up to 195 GB. Azure Data Lake Store can grow dynamically to hold trillions of files, with individual files greater than a petabyte. For more information, see Understanding blobs and Data Lake Store.

Also, check Benefits of Azure Storage and Use Data Lake Store for more details and comparisons.

Hope this helps.

2
votes

In addition to Ashok's answer: ADLS is currently only available in a few regions, compared to Azure Storage. So if you need your HDInsight account in a specific region, you should make sure your storage is in the same region.

Another benefit of ADLS over Azure Storage is its POSIX-based security model at the file/folder level that uses AAD security principals instead of Shared Access Keys.

The reason why you may not want to use ADLS for non-analytics data is primarily cost. Because of some of the additional capabilities, it is currently a bit more expensive.

0
votes

In addition to the other answers its not possible to use the Spark Data Factory activity on HDInsights clusters that use Data Lake as the primary storage. This limitation applies to both ADFv1 and v2 as seen here: https://docs.microsoft.com/en-us/azure/data-factory/v1/data-factory-spark and https://docs.microsoft.com/en-us/azure/data-factory/transform-data-using-spark