I'm reading through the AWS Redshift documentation regarding compression types. In the section on BYTEDICT compression it says the following:
Byte-dictionary encoding is not always effective when used with VARCHAR columns. Using BYTEDICT with large VARCHAR columns might cause excessive disk usage. We strongly recommend using a different encoding, such as LZO, for VARCHAR columns.
Assuming that "large VARCHAR columns" means "high cardinality," that recommendation makes sense. However, the last sentence seems to say one shouldn't bother using BYTEDICT
with VARCHAR
at all. That doesn't make sense to me though. If you had millions of VARCHAR
rows, but the cardinality was low (e.g. Canadian Provinces), wouldn't BYTEDICT
be the best choice?