0
votes

I have created a number of small staging tables in RedShift as part of an ETL process. Each table has between 50-100 rows (on average) with ~100 columns. When I query to see how much disk space each staging table requires, all columns are taking up exactly the same amount of space. The amount of space taken is far in excess of what is required. For example, 6 MB for 59 BOOLEAN values. I have tried multiple permutations of:

  • Column data types (varchar, timestamp, etc)
  • Column encodings (lzo, bytedict, etc)
  • Loading styles (individual insert, deep copy, etc)
  • Repeated VACUUMs in between all the above steps

Nothing seems to change the amount of space required for these staging tables. Why does RedShift not compress these tables more aggressively? Can I configure this in RedShift? Or should I simply force everything to be in one large staging table?

I'm using this query to determine disk space:

select name
    , col
    , sum(num_values) as num_values
    , count(blocknum) as size_in_mb
from svv_diskusage
group by name
    , col
2
I think I'm realizing my mistake here: since the minimum block size for a column is 1MB, the minimum size for my table must be 1MB * (number of colums). I was able to reduce the size by choosing a constant value for DISTKEY but it's not a massive reduction. - Conor
Yup, that's the reason. It will take 1 MB for a column even if only one record is going to that node. (So, if you have lot of nodes, you table size will be bloated till you reach a threshold, i.e fill the 1 MB blocks). - Paladin

2 Answers

0
votes

Since the blocksize in RedShift is 1MB all columns will take up 1MB per column at a minimum. On top of this if the DISTSTYLE is EVEN it will be closer to one block per slice in the database. Since there is no way to tweak the blocksize in RedShift there is no way to reduce the size of an empty table below (number of columns) * (slices containing data for each column) * 1MB.

0
votes

Its basically,

For tables created using the KEY or EVEN distribution style:

Minimum table size = block_size (1 MB) * (number_of_user_columns + 3 system columns) * number_of_populated_slices * number_of_table_segments

For tables created using the ALL distribution style:

Minimum table size = block_size (1 MB) * (number_of_user_columns + 3 system columns) * number_of_cluster_nodes * number_of_table_segments

number_of_table_segments is 1 for unsorted table and 2 for a table defined with a sort key.