Redshift table size

Question

This is more like a puzzling question for me and would like to understand why.

I have two tables, almost identical the only differences are one column's data type and sortkey.

table                             mbytes    rows
stg_user_event_properties_hist    460948    2378751028
stg_user_event_properties_hist_1  246442    2513860837

Even though they have almost same number of rows, size is close to double.

Here are the table structures

stg.stg_user_event_properties_hist
(
id                bigint,
source            varchar(20),
time_of_txn       timestamp,
product           varchar(50),
region            varchar(50),
city              varchar(100),
state             varchar(100),
zip               varchar(10),
price             integer,
category          varchar(50),
model             varchar(50),
origin            varchar(50),
l_code            varchar(10),
d_name            varchar(100),
d_id              varchar(10),
medium            varchar(255),
network           varchar(255),
campaign          varchar(255),
creative          varchar(255),
event             varchar(255),
property_name     varchar(100),
property_value    varchar(4000),
source_file_name  varchar(255),
etl_batch_id      integer,
etl_row_id        integer,
load_date         timestamp       
);



stg.stg_user_event_properties_hist_1
(
id                bigint,
source            varchar(20),
time_of_txn       timestamp,
product           varchar(50),
region            varchar(50),
city              varchar(100),
state             varchar(100),
zip               varchar(10),
price             integer,
category          varchar(50),
model             varchar(50),
origin            varchar(50),
l_code            varchar(10),
d_name            varchar(100),
d_id              varchar(10),
medium            varchar(255),
network           varchar(255),
campaign          varchar(255),
creative          varchar(255),
event             varchar(255),
property_name     varchar(100),
property_value    varchar(4000),
source_file_name  varchar(255),
etl_batch_id      integer,
etl_row_id        varchar(20),
load_date         timestamp
);

The differences again etl_row_id has data type varchar(20) in _1, integer in the other table, and the first table has a sortkey on source column.

What would be the explanation for the size difference?

UPDATE: The problem was both compression and sort keys, even though _1 table created with CTAS 11 of 26 had different compression settings, also the first table was created with Compound SortKey of 14 columns, recreated the table with no sort keys (it's a history table after all) size went down to 231GB.

Have you run VACUUM on both tables to remove deleted/updated rows? — John Rotenstein
please vacuum {{tablename}} to 100 percent. then also please show the compression for each column. — Jon Scott
I wasn't able to locate where column level compression info is, any query would be helpful @JonScott — demircioglu

Joe Harris Joe Harris · Accepted Answer · 2018-08-03T14:48:10

Suspect that the larger table has different compression settings or no compression at all. You can use our view v_generate_tbl_ddl to generate table DDL that includes the compression settings.

Even with the same compression settings table size can vary with different sort keys. The sort key is use to place the data into blocks on disk. If one sort key places lots of similar column values together it will compress better and require less space.

Redshift table size

3 Answers