0
votes

I've started learning AWS Redshift & I am coming across many things which I believe is not in favor of data warehouse star/snowflake schema.

Based on use responses, all recommended to have Redshift insert only method for the best performance because it is designed for read. But doesn't it increase the cost of storage? I am currently working on MSBI and my fact & dimension have complex structure. Ex: One fact table is shared across various business (data mart), few dimensions are type 2 (where I have to track the history) & few of them are not, few complex scenario need snowflake design.

Considering the cost of storage & computation on cloud, I would like to keep minimilastic data on cloud (the same I do it in my on premise system, which contribute to 4TB storage).

Now, if I do the same approach which I've been doing it on premise then I would have to run my ETL, compare key columns with staging & then perform CRUD, which makes it pointless to move my existing system to cloud. If I do go with flat table structure then I will end up having 4-6 times more data in my table which will increase the cost of storage on cloud and computation on top of it may cost extra.

How to handle Slowly Changing Dimension Type 2 in Redshift? Redshift Performance of Flat Tables Vs Dimension and Facts

Answers to above question talks about how flat tables can be more relatable to Redshift

https://aws.amazon.com/blogs/big-data/optimizing-for-star-schemas-and-interleaved-sorting-on-amazon-redshift/

But above Redshift blog talk about how star schema can be optimized.

Star and snowflake schemas run well on Amazon Redshift, and the addition of interleaved sort keys further enhances performance by reducing I/O for a wider range of filter predicates on a table when needed.

Now if I choose the approach for insert only (which compliments Redshift architecture) then I would be ending with paying more for storage. & if I choose to go for traditional data warehouse design then I would be ending up paying extra for computation cost.

Is there any real world example which you can state that can help me understand the approach you have followed in Redshift?

1

1 Answers

1
votes

In my experience Redshift handles flat tables well, and compression eliminates a lot of the storage overheads. For my use cases, the primary concern was keeping ETL as simple as possible, though.

Redshift will almost always suggest ZSTD compression, but for some dimensions you can get better compression with BYTEDICT when you know there will be few distinct values.

With a good sort key and a distribution key that supports your aggregation patterns you can make use of the full power of your cluster when querying a flat table and not be limited by bandwidth. The same goes for a star schema with distributed dimension tables, of course, but there's always that dimension which isn't small enough to be distributable and where the FK isn't a good fit as a distribution key.


Before you dive too deep into Redshift, also consider if Athena could be a solution for you. Using S3 for storage is way cheaper than Redshift disks, and performance is comparable for many use cases. There's also the hybrid in Redshift Spectrum where you could offload old partitions to S3 and only keep recent partitions in a smaller cluster.