I would like to understand the difference between the RAM
and storage
in Azure databricks.
Suppose I am reading csv data from the Azure data lake (ADLS Gen 2) as follows:
df = spark.read.csv("path to the csv file").collect()
I am aware that the
read
method in spark is aTransformation
method in spark. And this is not going to be run immediately. However, now if I perform anAction
using thecollect()
method, I would assume that the data is now actually been read from the data lake by Spark and loaded intoRAM
orDisk
. First, I would like to know, where is the data stored. Is it inRAM
or inDisk
. And, if the data is stored inRAM
, then what iscache
used for?; and if the data is retrieved and stored ondisk
, then what does persist do? I am aware thatcache
stores the data in memory for late use, and that if I have very large amount of data, I can usepersist
to store the data into adisk
.I would like to know, how much can databricks scale if we have peta bytes of data?
- How much does the
RAM
andDisk
differ in size? - how can I know where the data is stored at any point in time?
- What is the underlying operating system running Azure Databricks?
Please note that I am newbie to Azure Databricks and Spark.
I would like to get some recommendation on the best practices when using Spark.
Your help is much appreciated!!