4
votes

I need some clarity on Databricks DBFS.

In simple basic terms, what is it, what is the purpose of it and what does it allow me to do?

The documentation on databricks, says to this effect..

"Files in DBFS persist to Azure Blob storage, so you won’t lose data even after you terminate a cluster."

Any insight will be helpful, haven't been able to find documentation that goes into the details of it from architecture and usage perspective

3

3 Answers

5
votes

I have experience with DBFS, it is a great storage which is holding data which you can upload from your local computer using DBFS CLI! The CLI setup a bit tricky, but when you manage, you can easily move whole folders around in this environment (remember using -overwrite! )

  1. create folders
  2. upload files
  3. modify, remove files and folders

With Scala you can easily pull in the data you store in this storage with a code like this:

val df1 = spark
      .read
      .format("csv")
      .option("header", "true")
      .option("inferSchema", "true")
      .load("dbfs:/foldername/test.csv")
      .select(some_column_name)

Or read in the whole folder to process all csv the files available:

val df1 = spark
      .read
      .format("csv")
      .option("header", "true")
      .option("inferSchema", "true")
      .load("dbfs:/foldername/*.csv")
      .select(some_column_name)

I think it is easy to use and learn, I hope you find this info helpful!

2
votes

Databricks File System (DBFS) is a distributed file system mounted into a Databricks workspace and available on Databricks clusters.
DBFS is an abstraction on top of scalable object storage and offers the following benefits:
1) Allows you to mount storage objects so that you can seamlessly access data without requiring credentials.
2) Allows you to interact with object storage using directory and file semantics instead of storage URLs. Persists files to object storage(Blob), so you won’t lose data after you terminate a cluster.

Below link will help you to get more understanding on the Databricks utils commands: databricks-file-system link

1
votes

A few points in addition to the other answers worth mentioning:

  1. AFAIK, You don’t pay for storage costs associated with DBFS. Instead you pay an hourly fee to run jobs on DBX.

  2. Even though it is storing the data in blob/s3 in the cloud, you can’t access that storage directly. That means you have to use the DBX APIs or cli to access this storage.

  3. Which leads to the third and obvious point, Using DBFS will more tightly couple your spark applications to DBX. Which may or may not be what you want to do.