Question looks fairly simple but in my case it has some intricacies involved in it. So here is the deal:
- There is one sql database hosted on-premise and one sql Azure hosted database. We need to keep both databases in sync. These databases contain 50 tables.
Azure database will not be updated by any application but on premise database will keep on updating frequently. So we need modified/inserted data from on premise database to be moved to Azure database. We are using Azure Data Factory (ADF) for this.
All database tables contain one column called LastModifiedDate indicating when was record modified.
- Currently we have created staging tables corresponding to all 50 tables. We are maintaining one watermark table which contains table name and it's highest LastModifiedDate.
- There is one activity in ADF job which executes stored procedure which takes records from all tables having LastModifiedDate > corresponding LastModifiedDate in watermark table and dumps them into staging table.
- When execution of this stored procedure completes, all data from staging table is synced with Azure database tables. Finally watermark table is updated LastModifiedDate for each table. All staging tables are then flushed.
- This process will keep on repeating periodically, so that whenever data from on-premise is updated, Azure database will also update.
Problems with current approach:
Creating staging table correspoding to each table doesn't look like a good idea. If number of tables in database increases, we need those many corresponding staging table.
Question:
Is there any better approach to handle this scenario with using ADF and without creating huge number of staging tables?