3
votes

I am accessing Databricks Delta tables from Azure Data Factory, which does not have a native connector to Databricks tables. So, as a workaround, I create the tables with the LOCATION keyword to store them in Azure Data Lake. Then, since I know the table file location, I just read the underlying Parquet files from Data Factory. This works fine.

But... what if there is cached information in the Delta transaction log that has not yet been written to disk? Say, an application updated a row in the table, and the disk does not yet reflect this fact. Then my read from Data Factory will be wrong.

So, two questions...

  • Could this happen? Are changes held in the log for a while before being written out?
  • Can I force a transaction log flush, so I know the disk copy is updated?
2

2 Answers

1
votes

There have been similar questions asked on this topic (see here for example).

For delta tables you need delta-lake support (as the delta log is capturing the real truth). So as of now you have to use a Databricks activity for further processing with Azure Data Factory on delta tables (you could also replicate datasets to parquet, to make the data consumable for other services not supporting delta-lake yet). Theoretically you could do vacuum with a retention period of 0, but this is not recommended and could cause data inconsistency.

According to the Azure Feedback forum it is planned to have support on this in the future.

0
votes

ADF supports Delta Lake format as of July 2020:

https://techcommunity.microsoft.com/t5/azure-data-factory/adf-adds-connectors-for-delta-lake-and-excel/ba-p/1515793

The Microsoft Azure Data Factory team is enabling .. and a data flow connector for data transformation using Delta Lake

Delta is currently available in ADF as a public preview in data flows as an inline dataset.