I have just started working on a data analysis that requires analyzing high volume data using Azure Databricks. While planning to use Databricks notebook to analyze, I have come across different storage options to load the data a) DBFS - default file system from Databricks b) Azure Data Lake (ADLS) and c) Azure Blob Storage. Looks like the items (b) and (c) can be mounted into the workspace to retrieve the data for our analysis.
With the above understanding, may I get the following questions clarified please?
- What's the difference between these storage options while using them in the context of Databricks? Do DBFS and ADLS incorporate HDFS' file management principles under the hood like breaking files into chunks, name node, data node etc?
- If I mount Azure Blob Storage container to analyze the data, would I still get the same performance as other storage options? Given the fact that blob storage is an object based store, does it still break the files into blocks and load those chunks as RDD partitions into Spark executor nodes?