I am currently employed as a Junior Data Developer and recently saw a post saying that Azure Synapse can now create SQL tables from Delta tables. I tried creating an SQL table from a Delta table which is inside a Delta lake Storage V2, but the table is being populated with extra redundant data (all the data from all snapshots in the folder) when using 'PARQUET' as a file format and wildcard to read the files.
I tried creating an external file format for my table but Synapse doesn't accept 'DELTA' as a datatype. I used 'PARQUET' as a file format and used VACUUM on my Delta table to keep only the latest snapshot of it. Whenever I set the path to a specific file or once there was only a single snappy.parquet file in the Delta table, data was printed properly.
Basically is there any way to create a Synapse Table/External Table that get it's data from a Delta table?If not is there any way to stop Azure Deltalake from creating a new snapshot every time new data is written/updated/deleted?
Script used:
IF NOT EXISTS (SELECT * FROM sys.external_file_formats WHERE name = SynapseParquetFormat')
CREATE EXTERNAL FILE FORMAT [SynapseParquetFormat]
WITH ( FORMAT_TYPE = PARQUET)
GO
IF NOT EXISTS (SELECT * FROM sys.external_data_sources WHERE name = 'ExtSource')
CREATE EXTERNAL DATA SOURCE [ExtSource]
WITH (
LOCATION = '*',
)
GO
CREATE EXTERNAL TABLE dbo.ext_table (
[CostCentre] varchar(8000),
[CostCentre_MemberId] int
)
WITH (
LOCATION = 'dimensions/Dim_Example/*.snappy.parquet',
-- WILDCARD IF THERE IS ONLY ONE FILE OR LATEST FILE NEEDS TO BE SPECIFIED
DATA_SOURCE = [ExtSource],
FILE_FORMAT = [SynapseParquetFormat]
)
GO
/* '*' used because clients' data paths were used */