1
votes

We load data from on-prem database servers to Azure Data Lake Storage Gen2 using Azure Data Factory and Databricks store them as parquet files. Every run, we get only get the new and modified data from last run and UPSERT into existing parquet files using databricks merge statement.

Now we are trying to move this data from parquet files Azure Synapse. Ideally, I would like to do this.

  • Read incremental load data into a external table. (CETAS or COPY INTO)
  • Use above as staging table.
  • Merge staging table with production table.

The problem is merge statement is not available in Azure Syanpse. Here is the solution Microsoft suggests for incremental load

CREATE TABLE dbo.[DimProduct_upsert]
WITH
(   DISTRIBUTION = HASH([ProductKey])
,   CLUSTERED INDEX ([ProductKey])
)
AS
-- New rows and new versions of rows
SELECT      s.[ProductKey]
,           s.[EnglishProductName]
,           s.[Color]
FROM      dbo.[stg_DimProduct] AS s
UNION ALL  
-- Keep rows that are not being touched
SELECT      p.[ProductKey]
,           p.[EnglishProductName]
,           p.[Color]
FROM      dbo.[DimProduct] AS p
WHERE NOT EXISTS
(   SELECT  *
    FROM    [dbo].[stg_DimProduct] s
    WHERE   s.[ProductKey] = p.[ProductKey]
)
;

RENAME OBJECT dbo.[DimProduct]          TO [DimProduct_old];
RENAME OBJECT dbo.[DimProduct_upsert]  TO [DimProduct];

Basically dropping and re-creating the production table with CTAS. Will work fine with small dimenstion tables, but i'm apprehensive about large fact tables with 100's of millions rows with indexes. Any suggestions what would be the best way to do incremental loads for really large fact tables. Thanks!

1

1 Answers

1
votes

Till the time SQL MERGE is officially supported, the recommended way fwd to update target tables is to use T SQL insert/update commands between the delta records and target table.

Alternatively, you can also use Mapping Data Flows (in ADF) to emulate SCD transactions for dimensional/fact data load.