Azure Data Factory - How to read only the latest dataset in a Delta format Parquet built from Databricks?

Question

To be clear about the format, this is how the DataFrame is saved in Databricks:

folderpath = "abfss://[email protected]/folder/path"
df.write.format("delta").mode("overwrite").save(folderPath)

This produces a set of Parquet files (often in 2-4 chunks) that are in the main folder, with a _delta_log folder that contains the files describing the data upload. The delta log folder dictates which set of Parquet files in the folder should be read.

In Databricks, i would read the latest dataset for exmaple, by doing the following:

df = spark.read.format("delta").load(folderpath)

How would i do this in Azure Data Factory? I have chosen Azure Data Lake Gen 2, then the Parquet format, however this doesn't seem to work, as i get the entire set of parquets read (i.e. all data sets) and not just the latest.

How can i set this up properly?

That would be whatever the latest run through of databricks produces, and saves to the storage. — user3012708
Hi @user3012708 ,Another question, do you want to achieve that with pipeline actives? Because Data factory supports run databrick scripts with python or notebook. Per my experience, it's very hard to achieve that. — Leon Yue
Has to be with the pipeline, can't just run the notebook or something sadly. Effectively i'd be reading the data from Azure Data Lake Storage gen 2, but i only want the latest from there. This is defined with the _delta_log files somehow, but i don't know how ADF will read them, since it seems to read all the parquet files together. — user3012708
I share my ideas for you in the answer, hope it's helpful for you. — Leon Yue

Leon Yue Leon Yue · Accepted Answer · 2020-12-03T07:12:29

With Data Factory pipeline, it seems to be hard to achieve that. But I have some ideas for you:

Use lookup active to get the content of delta_log file. If there many files, use get metadata to get the all the files schema(last modified date).
Use an if condition active or swich active to filter the latest data.
After the data filtered, pass the lookup output to set the copy active source(set as parameter).

The hardest thing is that you need figure out how to filter the latest dataset with delta_log. You could try this way, the whole work flow should like this but I can't tell you if it really works. I couldn't test that for you without same environment.

HTP.

Azure Data Factory - How to read only the latest dataset in a Delta format Parquet built from Databricks?

1 Answers