I just started using delta lake so my mental model might be off - I'm asking this question to validate/refute it.
My understanding of delta lake is that it only stores incremental changes to data (the "deltas"). Kind of like git - every time you make a commit, you're not storing an entire snapshot of the codebase - a commit only contains the changes you made. Similarly, I would imagine that if I create a Delta table and then I attempt to "update" the table with everything it already contains (i.e. an "empty commit") then I would not expect to see any new data created as a result of that update.
However this is not what I observe: such an update appears to duplicate the existing table. What's going on? That doesn't seem very "incremental" to me.
(I'll replace the actual UUID values in the filenames for readability)
# create the data
dataGen = sc._jvm.org.apache.hudi.QuickstartUtils.DataGenerator()
inserts = sc._jvm.org.apache.hudi.QuickstartUtils.convertToStringList(dataGen.generateInserts(200))
df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
delta_base_path = "s3://my-delta-table/test"
(df.write.format("delta")
.mode("overwrite")
.save(delta_base_path))
!aws s3 ls --recursive s3://my-delta-table/test
2020-10-19 14:33:21 1622 test/_delta_log/00000000000000000000.json
2020-10-19 14:33:18 0 test/_delta_log_$folder$
2020-10-19 14:33:19 10790 test/part-00000-UUID1-c000.snappy.parquet
2020-10-19 14:33:19 10795 test/part-00001-UUID2-c000.snappy.parquet
Then I overwrite using the exact same dataset:
(df.write.format("delta")
.mode("overwrite")
.save(delta_base_path))
!aws s3 ls --recursive s3://my-delta-table/test
2020-10-19 14:53:02 1620 test/_delta_log/00000000000000000000.json
2020-10-19 14:53:08 872 test/_delta_log/00000000000000000001.json
2020-10-19 14:53:01 0 test/_delta_log_$folder$
2020-10-19 14:53:07 6906 test/part-00000-UUID1-c000.snappy.parquet
2020-10-19 14:53:01 6906 test/part-00000-UUID3-c000.snappy.parquet
2020-10-19 14:53:07 6956 test/part-00001-UUID2-c000.snappy.parquet
2020-10-19 14:53:01 6956 test/part-00001-UUID4-c000.snappy.parquet
Now there are TWO part-0 and part-1s, of the exact same size. Why doesn't Delta dedupe the existing data?