0
votes

for one of my use cases I am using change data feed (CDF) feature of delta lake,it goes all well with CDF but when i read all the data to insert in gold, it lists all the versions, is there a way i can read only the latest version without specifying version number or a way to fetch latest version ?

        return spark.read.format("delta") \
                  .option("readChangeFeed", "true") \
                  .table(tableName) \
                  .where(col("_change_type") != "preimage")

above code block returns results from all versions since start, i can fetch only latest data by looking into the table and specifying the version but i don't understand how to enable this in production, I don't want to use timestamp to fetch the latest version as in case of retries some one might run the pipeline multiple times a day and this will bring data inaccuracies if not handled as 1st run of the day. Any help would be appreciated.

1

1 Answers

0
votes

We can write a query for row level modifications to get the different versions of a delta table.

Similar kind of SO fixed by @Tim / @Alex Ott.

Read as a Stream like below example syntax from the above SO,

spark.readStream
     .format("delta")
     .option("readChangeFeed", "true")
     .option("startingVersion", "latest")
     .table(tableName) 
     .filter("_change_type != 'update_preimage'")