I have created a number of small staging tables in RedShift as part of an ETL process. Each table has between 50-100 rows (on average) with ~100 columns. When I query to see how much disk space each staging table requires, all columns are taking up exactly the same amount of space. The amount of space taken is far in excess of what is required. For example, 6 MB for 59 BOOLEAN values. I have tried multiple permutations of:
- Column data types (varchar, timestamp, etc)
- Column encodings (lzo, bytedict, etc)
- Loading styles (individual insert, deep copy, etc)
- Repeated VACUUMs in between all the above steps
Nothing seems to change the amount of space required for these staging tables. Why does RedShift not compress these tables more aggressively? Can I configure this in RedShift? Or should I simply force everything to be in one large staging table?
I'm using this query to determine disk space:
select name
, col
, sum(num_values) as num_values
, count(blocknum) as size_in_mb
from svv_diskusage
group by name
, col