1
votes

I started to develop data warehouse in telecommunication domain. I'm familiar with Kimball's approach (will model my DW as Star Schema). I expect many fact tables, the biggest could be up to tens of billions of rows. I can't imagine all types of queries that will be run again my data warehouse in a few years.

Modern db approaches says that computational capacity should be decoupled from storage, i.e. Azure Synapse Analytics, Redshift Spectrum, Presto, etc. In turn, such databases as Greenplum or Redshift use always running PostgreSQL servers to store data.

What could be the reasons to prefer RDBMs based data warehouse over "decoupled"?

While I totally like decoupled approach, my confusion is based on the fact that I have absolutely no idea of performance impact for analyzing data in S3/ADLS vs RDBMs:

  • If you choose Redshift/Greenplum with inability to pause the cluster (and use serverless approach) you get performance optimization of RDBMs systems for lookups, aggregations and joins.
  • If you choose MPP datawarehouse based on S3/ADLS you have run queries over cloud storages. That will give you ability to scale/pause cluster very quick, or even use serverless approach if cloud vendor allow that. But I can't find good documentation for Redshift/Azure Synapse Analytics of how they index data. Does they use parquet and indexes? Does they use custom column storage formats? Does they struggle from doing full scans on very selective queries?

P.S. If this kind of questions haven't become general theory yet and are very opinion based - should I make all decisions based on PoCs? I'm confused about this option, since DW PoC could take pretty much time. Maybe you know about any performance benchmarks?

1
You should also consider integration with your existing estate and your other requirements in detail. Bear in mind that all solutions will have pros and cons which only a very few are addresses by your technical concerns listed in your question. personally i think that "Modern db approaches says that computational capacity should be decoupled from storage" should not be your main concern or even probably in your top 10.Jon Scott

1 Answers

2
votes

But I can't find good documentation for [. . .] Azure Synapse Analytics of how they index data.

Azure Synapse Analytics supports both reading and writing to Azure Data Lake Store (ALDS), which it calls "External Tables", and regular, or "Internal Tables", where it is reading and writing to a distributed database stored on Flash-based storage, which is both highly compressed and heavily cached. Internal Tables are Clustered Columnstores by default, but can be Heaps or Clustered Indexes too. And you can have partitioning and secondary indexes if you need to. See CREATE TABLE.

So ADLS forms the lowest-level, lowest-cost, slowest storage tier, and you can build tables that are stored in faster-to-access SQL Server internal storage.

So the Storage Hierarchy looks like this:

  • Compute Node RAM
  • Compute Node local NVMe Flash Storage
  • Database Files stored on Remote Distributed Flash Storage
  • Parquet files stored on Azure Data Lake Store

This is from the Architecture Overview doc: enter image description here

will model my DW as Star Schema

So a common pattern with Synapse Analytics is to store the raw data in Parquet files, stored in ADLS, and build the consumption-ready star schema using internal tables, typically with the fact tables Distributed using Hash or Round-Robin distribution, and the Dimensions stored as Replicated tables.