1
votes

Lets say we have dimensional data model in our Snowflake data warehouse that is loaded once a day. The requirement is that the data loaded today should be available to end users only when ALL objects have been loaded successfully.

So if dimension d1 has been loaded successfully today, but fact f1 load fails for what ever reason, then the end users should stay with the data loaded yesterday (also for the dimension d1 then). The flip over to the new data should take place only when all dimensional model objects have loaded successfully.

We have currently then done it the way that we stage the incremental set / new data to staging tables and once all successfully there then there is a huge SQL script doing DELETE/INSERT one dimensional model table at a time and then the single commit in the end. So the enda data load is not done in parallel.

Is there a better way to achieve the requirement? We were thinking about materialized views (load all the underlying dimensional model tables and the refresh the "select * from table" mat. views ) but they are refreshed automatically in Snowflake. Also then DROP target table / CLONE from source to target table does not suite us as there is the chance that the target table is queried at the time when it does not exist.

2
What is orchestrating your ELT job? If your fact table load fails then the orchestration tool should be able to detect this error and then you can decide to remove all records in your dimension with a particular date.Simon D
@user12761950 you can suspend materialized view .PIG
@PIG The suspend feature does not work for us as the view can not be used when suspended. But we need the previous data to be available until the new data has veen loaded.user12761950

2 Answers

1
votes

Ok, I think I may have understood your problem statement. But let me rephrase it so that I am sure I have really understood it. You have a number of dimension tables and a fact table. For example

You have D1, D2,D3 as dimension table and then you have a fact table F1. Your daily ETL process loads F1, D1, D2 and D3. But you want to load D1,D2,D3 with the new data only when the F1 load is successful and during the load you do not want users to get affected.

If this is correct one approach will be as below(this approach has a disadvantage that it will have additional storage cost associated for the duration of the ETL)

When your ETL starts, clone new tables F1', D1',D2',D3' from the respective F1, D1,D2,D3 tables. Load the incremental data first to F1',D1',D2' and D3'. If the load is successful for all these tables do an ALTER SWAP and swap F1' to become F1, D1' to become D1, D2' to become D2 and D3' to become D3. Post that you can truncate F1',D1',D2'D3' and repeat this process everyday.

0
votes

I do not understand why you eliminate the CLONE option. You do not need to clone each object, you can clone the entire database or schema with one command, make it available to your users. Next day, when all data is loaded, you can re-clone the object. It seems like an ideal solution for what you want to achieve.

https://community.snowflake.com/s/article/cloning-in-snowflake

The CLONE operation does not copy data, it is a metadata operation. Therefore it will not occupy extra space unless the cloned data is removed from the original/source tables, or you add new data to the cloned objects:

https://docs.snowflake.com/en/user-guide/tables-storage-considerations.html#cloning-tables-schemas-and-databases