The best practice to be followed when reading data from azure datalake gen1 through azure databricks

Question

I am new to azure databricks. I was trying to read data from datalake into databricks. I found that there are mainly two methods

Mounting the file present in datalake into dbfs (Advantage being Authentication required just once)
Using Service Principal and OAuth (Authentication required for each request)

I am interested to know if there is some significant memory consumption when we choose to mount folders in dbfs. I learnt that the data mounted is persisted . So I guessing that might lead to some memory consumption. I'll like if somebody can explain me what's going on the backend when we mount a file in dbfs

Mekki Mekki · Accepted Answer · 2020-02-27T18:49:58

The question of persistent data:

As far as I have understood based on the documentation of dbfs, the data read in from the mount point through dbfs is not persisted:

"Data written to mount point paths (/mnt) is stored outside of the DBFS root. Even though the DBFS root is writeable, we recommend that you store data in mounted object storage rather than in the DBFS root."

Instead, you can write data directly to the DBFS (which is, under the hood, just a Storage account), and that data will persist between the restarts of your cluster. For example, you could store some example dataset directly in DBFS.

Best practice with Data Lake Gen 1

As there shouldn't be any performance implications, I don't know there is a "best practice" overall. Based on my experience it is good to keep in mind that both solutions might seem confusing to new users who don't know how authentication was or is done.

The best practice to be followed when reading data from azure datalake gen1 through azure databricks

1 Answers