I started to develop data warehouse in telecommunication domain. I'm familiar with Kimball's approach (will model my DW as Star Schema). I expect many fact tables, the biggest could be up to tens of billions of rows. I can't imagine all types of queries that will be run again my data warehouse in a few years.
Modern db approaches says that computational capacity should be decoupled from storage, i.e. Azure Synapse Analytics, Redshift Spectrum, Presto, etc. In turn, such databases as Greenplum or Redshift use always running PostgreSQL servers to store data.
What could be the reasons to prefer RDBMs based data warehouse over "decoupled"?
While I totally like decoupled approach, my confusion is based on the fact that I have absolutely no idea of performance impact for analyzing data in S3/ADLS vs RDBMs:
- If you choose Redshift/Greenplum with inability to pause the cluster (and use serverless approach) you get performance optimization of RDBMs systems for lookups, aggregations and joins.
- If you choose MPP datawarehouse based on S3/ADLS you have run queries over cloud storages. That will give you ability to scale/pause cluster very quick, or even use serverless approach if cloud vendor allow that. But I can't find good documentation for Redshift/Azure Synapse Analytics of how they index data. Does they use parquet and indexes? Does they use custom column storage formats? Does they struggle from doing full scans on very selective queries?
P.S. If this kind of questions haven't become general theory yet and are very opinion based - should I make all decisions based on PoCs? I'm confused about this option, since DW PoC could take pretty much time. Maybe you know about any performance benchmarks?