Where does Spark saves retrieved data on Azure Databricks?

Question

I would like to understand the difference between the RAM and storage in Azure databricks.

Suppose I am reading csv data from the Azure data lake (ADLS Gen 2) as follows:

df = spark.read.csv("path to the csv file").collect()

I am aware that the read method in spark is a Transformation method in spark. And this is not going to be run immediately. However, now if I perform an Action using the collect() method, I would assume that the data is now actually been read from the data lake by Spark and loaded into RAM or Disk. First, I would like to know, where is the data stored. Is it in RAM or in Disk. And, if the data is stored in RAM, then what is cache used for?; and if the data is retrieved and stored on disk, then what does persist do? I am aware that cache stores the data in memory for late use, and that if I have very large amount of data, I can use persist to store the data into a disk.
I would like to know, how much can databricks scale if we have peta bytes of data?
How much does the RAM and Disk differ in size?
how can I know where the data is stored at any point in time?
What is the underlying operating system running Azure Databricks?

Please note that I am newbie to Azure Databricks and Spark.

I would like to get some recommendation on the best practices when using Spark.

Your help is much appreciated!!

Gaurang Shah Gaurang Shah · Accepted Answer · 2019-10-08T19:24:28

First, I would like to know, where is the data stored.

When you run any action (i.e. collect or others) Data is collected from executors nodes to driver node and stored in ram (memory)

And, if the data is stored in RAM, then what is cache used for

Spark has lazy evaluation what does that mean is until you call an action it doesn't do anything, and once you call it, it creates a DAG and then executed that DAF.

Let's understand it by an example. let's consider you have three tables Table A, Table B and Table C. You have joined this table and apply some business logic (maps and filters), let's call this dataframe filtered_data. and now you are using this DataFrame in let's say 5 different places (another dataframes) for either lookup or join and other business reason.

if you won't persist(cache) your filterd_data dataframe, everytime it will be referenced, it will again go through joins and other business logic. So it's advisable to persist(cache) dataframe if you are going to use that into multiple places.

By Default Cache stored data in memory (RAM) but you can set the storage level to disk

would like to know, how much can databricks scale if we have petabytes of data?

It's a distributed environment, so what you need to do is add more executors. and may be need to increase the memory and CPU configuration,

how can I know where the data is stored at any point in time?

if you haven't created a table or view, it's stored in memory.

What is the underlying operating system running Azure Databricks?

it uses linux operation system. specifically Linux-4.15.0-1050-azure-x86_64-with-Ubuntu-16.04-xenial

you can run the following command to know.

import platform
println(platform.platform())

Where does Spark saves retrieved data on Azure Databricks?

1 Answers