In Snowflake how to make NEW data available once all related objects have been loaded?

Question

Lets say we have dimensional data model in our Snowflake data warehouse that is loaded once a day. The requirement is that the data loaded today should be available to end users only when ALL objects have been loaded successfully.

So if dimension d1 has been loaded successfully today, but fact f1 load fails for what ever reason, then the end users should stay with the data loaded yesterday (also for the dimension d1 then). The flip over to the new data should take place only when all dimensional model objects have loaded successfully.

We have currently then done it the way that we stage the incremental set / new data to staging tables and once all successfully there then there is a huge SQL script doing DELETE/INSERT one dimensional model table at a time and then the single commit in the end. So the enda data load is not done in parallel.

Is there a better way to achieve the requirement? We were thinking about materialized views (load all the underlying dimensional model tables and the refresh the "select * from table" mat. views ) but they are refreshed automatically in Snowflake. Also then DROP target table / CLONE from source to target table does not suite us as there is the chance that the target table is queried at the time when it does not exist.

What is orchestrating your ELT job? If your fact table load fails then the orchestration tool should be able to detect this error and then you can decide to remove all records in your dimension with a particular date. — Simon D
@PIG The suspend feature does not work for us as the view can not be used when suspended. But we need the previous data to be available until the new data has veen loaded. — user12761950

Rajib Deb Rajib Deb · Accepted Answer · 2020-10-25T23:24:59

Ok, I think I may have understood your problem statement. But let me rephrase it so that I am sure I have really understood it. You have a number of dimension tables and a fact table. For example

You have D1, D2,D3 as dimension table and then you have a fact table F1. Your daily ETL process loads F1, D1, D2 and D3. But you want to load D1,D2,D3 with the new data only when the F1 load is successful and during the load you do not want users to get affected.

If this is correct one approach will be as below(this approach has a disadvantage that it will have additional storage cost associated for the duration of the ETL)

When your ETL starts, clone new tables F1', D1',D2',D3' from the respective F1, D1,D2,D3 tables. Load the incremental data first to F1',D1',D2' and D3'. If the load is successful for all these tables do an ALTER SWAP and swap F1' to become F1, D1' to become D1, D2' to become D2 and D3' to become D3. Post that you can truncate F1',D1',D2'D3' and repeat this process everyday.

In Snowflake how to make NEW data available once all related objects have been loaded?

2 Answers