1
votes

We have Azure data lake storing data in parquet files in delta lake format. After every run, where new data is merged, we call vacuum with 0 hour retention to remove old files and run optimize command.

But for some reason, old files are not being deleted. No errors in the databricks notebook though. It says 2 files removed, but i still see them. Am i missing something obvious? Thanks!

sqlContext.sql(f"VACUUM  '{adls_location}' RETAIN 0 HOURS")
time.sleep(60)
sqlContext.sql(f"VACUUM  '{adls_location}' RETAIN 0 HOURS")
time.sleep(60)
sqlContext.sql(f"OPTIMIZE '{adls_location}'")
1

1 Answers

1
votes

You cannot use VACUUM directly on cloud storage. To vacuum storage, you must mount it to DBFS and run VACUUM on the mounted directory.

Reference: Azure Databricks - Vacuum