1
votes

Production Cluster details:

  • Node Type dc1.8xlarge
  • Nodes 25
  • 2.56TB SSD storage per node

Test Cluster details:

  • Node Type ds2.xlarge
  • Nodes 6
  • 2TB HDD storage per node

When same table with exactly same DDL & encoding is unloaded and copied from production cluster to test cluster, its disk footprint reduces exponentially. This has been tested with multiple tables with different distribution styles and sort key patterns.

Example: Table A (No sort key, DISTSYLE EVEN) - Size in production: 60GB; Size in test: 0.6 GB

Table B (Sort key, DISTSTYLE KEY) - Size in production: 96GB 100% sorted; Size in test: 1.4 GB 100% sorted

Any ideas what can result in this discrepancy? I have read most of redshift forums but not able to find a reason for this issue. I am using the admin view v_space_used_per_tbl (provided by AWS) for calculating size of the table.

1
Please do not crosspost (The question is better suited for dba.stackexchange.com)a_horse_with_no_name

1 Answers

2
votes

If the table row count is small your table size is defined by the minimum table size for your cluster.

Redshift allocates a minimum of 1MB per column per slice in your cluster. A dc1.8xlarge node has 32 slices so on a 25 node cluster that's ~800MB per column.

A 60GB minimum size table has ~72 user columns plus 3 internal [fewer if some of the columns are large VARCHAR()]. A ds2.xlarge node has 2 slices so on a 6 node cluster the minimum size of a 72 column table is ~900MB.

For the large cluster in particular, you should consider using DISTSTYLE ALL if the table is a "dimension" table (small-ish lookup table). That reduces the minimum size to 1MB per column per node.

For a detailed explanation please refer to the AWS Knowledge Base article Why does a table in my Amazon Redshift cluster consume more disk storage space than expected?