0
votes

I am using Azure Databricks with latest runtime for the clusters. I had some confusion regarding VACUUM operation in delta lake. We know we can set a retention duration on the deleted data, however, for actual data to be delete after the retention period is over, do we need to keep the Cluster Up for the entire duration?

In simple words-: Do we need to have Cluster always in running state in order to leverage Delta lake ?

2

2 Answers

1
votes

You don't need to always keep a cluster up and running. You can schedule a vacuum job to run daily (or weekly) to clean up stale data older than the threshold. Delta Lake doesn't require an always-on cluster. All the data/metadata are stored in the storage (s3/adls/abfs/hdfs), so no need to keep anything up and running.

-2
votes

Apparently you need a cluster to be up and running always to query for the data available in databricks tables.

If you have configured the external meta store for databricks, then you can use any wrappers like apache hive by pointing it to that external meta store DB and query the data using hive layer without using databricks.