0
votes

Going to do a POC on Snowflake and just wanted to check what is the best practice around loading the data to snowflake:

  1. Should load data in normalized (Group and store related information into multiple tables) Or go with Denormalized form? What is recommended here..?
  2. Or dump data to one table and create multiple views from one table? But think that The big table has 150 million records and it has a column called Australia State and we know that we have only 6 states in Australia. If a create a view to extract Australia State information from main table via view, I feel like it will be more costly than store Australia State Information in a separate table and that is what I am talking about normalization..?
  3. What is the way to load SCD-2 dimensions in Snowflake? Interested to know the efficient way to do this..?
2

2 Answers

1
votes

Your questions 1. and 2. seem to be more about partitioning (or "clustering" in Snowflake lingo) than normalization. It is also about performance vs. maintainability.

The best of two worlds would be to have a single table where Australia State is a clustering key. Correct setup will allow for efficient Query pruning. Read more in Clustering Keys & Clustered Tables.

Re. question 3. Look into MERGE. Maybe you also can get some hints reading Working with SCD-Type-II in Snowflake

0
votes

I would load the data the way that "makes the most sense for how it will be 'updated' and 'used'"

Which is to mean we have data (many forms actually) that we sync/stream from PostgreSQL DB's, and some of it we dimension it (SCD1/SCD2/SCD6) as we load it. For this data we have the update timestamp we we load the record, we workout the changes and build the dimension data.

If you already have dimension data, and it's a single data move. Dump the tables you have and just load them. It's really cheap to make a new table in snowflake, so we just tried stuff and worked out what fitted our data ingress patterns, and how we were reading the data to improve/help clustering, or avoid churn that costs on the auto-clustering operations.