1
votes

We are loading source files every day in delta lake table(table name: dst1) as time series data. If deltalake table does not exist we will create a table. If table does exist we will merge the new data as time series data.

Creating Table if table does not exist:
spark.sql(f"CREATE TABLE {dtable} USING DELTA LOCATION {dmount1}")       
 

If table does exist we will merge the new data as time series data:
here df --> has data for day 1 
     dtable --> has data for day 0.
     jkey --> d.id=ds.id

spark.sql(f"MERGE INTO {dtable} d USING df ds ON {jkey} WHEN MATCHED THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT *")

Data from source on day1:

Data from source on day1

Data from source on day2:

Data from source on day2

select * from dst1 VERSION AS OF 0
id  lname   fname   gender  satscore
1   Steve   Jobs    M   1100
2   Mapple  Syrup   F   1300
4   Tom     Cruise  M   1200
6   New     Record  F   1320

select * from dst1 VERSION AS OF 1
id  lname   fname   gender  satscore
1   Steve   Not     M   2000
2   Mapple  Syrup   F   1300
4   Tom     Cruise  M   1200
6   New     Record  F   1320

Since day 2 does not have id 2,4,6 but still I see them in version 1 of delta lake table. I am expecting only id 1 in version 1 of delta table. How can I achieve this?

https://docs.databricks.com/delta/delta-update.html#language-sql