I am currently exploring delta lake which is open sourced by databricks. I am reading kafka data and writing as stream using delta lake format. Delta lake creates many files during streaming write from kafka which i feel hearts hdfs file system.
I have tried following to compact multiple files to single file.
val spark = SparkSession.builder
.master("local")
.appName("spark session example")
.getOrCreate()
val df = spark.read.parquet("deltalakefile/data/")
df.repartition(1).write.format("delta").mode("overwrite").save("deltalakefile/data/")
df.show()
spark.conf.set("spark.databricks.delta.retentionDurationCheck.enabled","false")
DeltaTable.forPath("deltalakefile/data/").vacuum(1)
But when i checked output it is creating new file and not removing any existing files.
Is there is way to achieve this. Also what is relation of retention period here? How should we configure it in HDFS while using? What should be my configuration for retention when i want to build raw/bronze layer with delta lake format and i want to preserve my all data for long period (years on premises/infinite time on cloud)?