0
votes

I am trying to create an external BQ external table on top of delta table which uses google storage as a storage layer. On the delta table, we perform DML which includes deletes.

I can create an BQ external table on top of the gs bucket where all the delta files exists. However, it is pulling even the delete records since BQ external table cannot read the transaction logs of delta where it says which parquet files to consider and which one to remove.

is there a way we can expose the latest snapshot of the delta table (gs location) in BQ as an external table apart from doing it programmatically copying data from delta to BQ?

1
Do you have a naming convention with your file? What is the frequency of your file update? - guillaume blaquiere

1 Answers

0
votes

I'm also developing this kind of pipeline where we dump our delta lake files in GCS and present it on Bigquery. The generation of manifest file from your GCS delta files will give you the latest snapshot based on what version current set on your delta files. Then you need to create a custom script to parse that manifest file to get the list of files and then run a bq load mentioning those files.

val deltaTable = DeltaTable.forPath(<path-to-delta-table>)
deltaTable.generate("symlink_format_manifest")