2
votes

This is more like a puzzling question for me and would like to understand why.

I have two tables, almost identical the only differences are one column's data type and sortkey.

table                             mbytes    rows
stg_user_event_properties_hist    460948    2378751028
stg_user_event_properties_hist_1  246442    2513860837

Even though they have almost same number of rows, size is close to double.

Here are the table structures

stg.stg_user_event_properties_hist
(
id                bigint,
source            varchar(20),
time_of_txn       timestamp,
product           varchar(50),
region            varchar(50),
city              varchar(100),
state             varchar(100),
zip               varchar(10),
price             integer,
category          varchar(50),
model             varchar(50),
origin            varchar(50),
l_code            varchar(10),
d_name            varchar(100),
d_id              varchar(10),
medium            varchar(255),
network           varchar(255),
campaign          varchar(255),
creative          varchar(255),
event             varchar(255),
property_name     varchar(100),
property_value    varchar(4000),
source_file_name  varchar(255),
etl_batch_id      integer,
etl_row_id        integer,
load_date         timestamp       
);



stg.stg_user_event_properties_hist_1
(
id                bigint,
source            varchar(20),
time_of_txn       timestamp,
product           varchar(50),
region            varchar(50),
city              varchar(100),
state             varchar(100),
zip               varchar(10),
price             integer,
category          varchar(50),
model             varchar(50),
origin            varchar(50),
l_code            varchar(10),
d_name            varchar(100),
d_id              varchar(10),
medium            varchar(255),
network           varchar(255),
campaign          varchar(255),
creative          varchar(255),
event             varchar(255),
property_name     varchar(100),
property_value    varchar(4000),
source_file_name  varchar(255),
etl_batch_id      integer,
etl_row_id        varchar(20),
load_date         timestamp
);

The differences again etl_row_id has data type varchar(20) in _1, integer in the other table, and the first table has a sortkey on source column.

What would be the explanation for the size difference?

UPDATE: The problem was both compression and sort keys, even though _1 table created with CTAS 11 of 26 had different compression settings, also the first table was created with Compound SortKey of 14 columns, recreated the table with no sort keys (it's a history table after all) size went down to 231GB.

3
Have you run VACUUM on both tables to remove deleted/updated rows?John Rotenstein
please vacuum {{tablename}} to 100 percent. then also please show the compression for each column.Jon Scott
Yes both tables are vacuumed before I posted the questiondemircioglu
I wasn't able to locate where column level compression info is, any query would be helpful @JonScottdemircioglu

3 Answers

2
votes

Suspect that the larger table has different compression settings or no compression at all. You can use our view v_generate_tbl_ddl to generate table DDL that includes the compression settings.

Even with the same compression settings table size can vary with different sort keys. The sort key is use to place the data into blocks on disk. If one sort key places lots of similar column values together it will compress better and require less space.

2
votes

The sizes are different for these two tables because one table is being allocated more blocks than the other based on sortkeys. For your bigger table, the distribution is happening in such a way that the disk blocks are not fully occupied, thus needing more blocks to store the same amount of data.

This happens because of the 1MB block size of Redshift and the way it stores data across slices and nodes. In general, data gets distributed across different nodes and slices based on the diststyle. For your case I am assuming this distribution is happening in a round robin way. So slice1 gets the first record, slice2 gets second record etc. As the minimum block size is 1MB for Redshift, every time a new record goes to a new slice, 1MB gets allocated (even if the record only takes a few KBs). For subsequent records to the same slice, data goes to the same 1MB block till it is possible, after which a new 1MB block gets allocated on the slice. But, if there are no more records after the first record for this slice, it still occupies the first block of 1MB size. The total size of the table is the sum of all blocks being occupied (irrespective of how much data in present in the blocks)

0
votes

The difference in table size could be due to the following reasons.

  • The encoding used for each column. (query PG_TABLE_DEF)
  • Distribution key used for the table. (query PG_TABLE_DEF)
  • Vaccum performed on the table. (query SVV_VACUUM_SUMMARY)

If I’ve made a bad assumption please comment and I’ll refocus my answer.